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INTRODUCTION - This note considers the feature selection problem resulting 
from the transformation x = Bz where B is a k by n matrix of rank k 
and k ^ n. Such a transformation can be considered to reduce the dimension 
of each observation vector z, and in general, such a transformation results 
in a loss of "information". In terms of the divergence, this information 
loss is expressed by the fact that the average divergence Dg computed using 
variable x is less than or equal to the average divergence D computed 
using variable z. If Dg = D, then B is said to be a sufficient statistic 
for the average divergence D. If B is a sufficient statistic for the 
average divergence, then it can be shown that the probability of misclassification 
computed using variable x (of dimension k S n) is equal to the probability 
of misclassification computed using variable z. 

In actual practice, D can be somewhat less than D and yet retain 
enough information (as measured by the probability of misclassification)* Although 
the necessary ratio of problem dependent, empirical 

results seem to Indicate that this ratio lie in the range .8 S Dg/D ^ I* The 
global or absolute maximum value of Dg over the class of all k by n 
matrices B is a function of k. Let D^* denote this global maximum. The 

D 

main purpose of this note is to develop an upper bound (a function of k) 

which necessarily satisfies in general 

V ^ \ i “ 

It is shown that (j)j^ can be rather easily obtained for 1 < k < n by solving 
for the eigenvalues of m distinct n by n matrices, where m is the 



number of distinct classes. Thus only mn distinct eigenvalues, obtained but 
once, are adequate to determine for any k 5 n. (If channel selection is 

desired and is small, then more than k channels should be selected to 

process the data) . 

Also included in this note is what is believed to be a new proof of the 

well known fact that DSD. Using the techniques necessary to prove the 

above fact, it is shown that the "Brattacharra distance" as measured by 

variable fi is less than or equal to the Brattacharra distance as measured by 

variable z. Finally upper and lower bounds on the Bratacharyya distance as 
measured by x are derived. The expression for the gradient of the Bratachairyya 
distance with respect to the matrix B is also derived. Although all the 
Bratacharyya results are for the two class problem, they can easily be extended 
to the situation of m-distinct classes. 

DISCUSSION 

We are interested in comparing n-dimensional information measures with 
k-dimensional information measures algebraically; that is by using various 
matrix operations. All the necessary algebraic relationships will be discussed 
and considered below. Also, these algebraic properties will be related to the 
interclass divergence (Reference 1) and the Bratacharra distance (Reference 2). 
The following theorem from Reference 3 is essential to the discussion. 

Theorem 1 - Consider the sequence of symmetric matrices 

A^ = •••> ^ 

for r = l,2,...,n. Let A, (A ) denote the k*th characteristic root of 

K. IT 

A , where 

r’ 
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Then 




The following corollary follows immediately from Theorem 1 and will be used 
frequently. 

Corol U g _ l - S < X^(A.^^) < X^(A. ) 

T 

Lemma 1 - Let A and Q be real n by n square matrices where QQ = I 

and A is symmetric. Then if X and x are an eigenvalue and corresponding 

eigenvector of A, then X and Qx are an eigenvalue with corresponding 

T 

eigenvector of QAQ . 

Proof; (QAQ^)Qx = QA(q'^Q)x 

= QAx 
= XQx 


Q.E.D 


we define! 


B ; a real k by n matrix of rank k 5 n. 

A ; a real n by n symmetric positive definite matrix. 

S ; an n by n symmetric matrix. 


Define the function 

ip = -| tr{(BAB'^)"^(BSB'^)} 

where tr denotes the trace of a matrix. We use the notation to denote 


the matrix whose i-j'th element is the 




9b 


ij 


where b . . is the element in 
ij 


the i'th row and J'th column of B. The following three Lemmas are proved 
in Reference 2 and are included for completeness. 
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i'^(bab'^)“^(bsb'^) ] (bab'^)"^ 


Q is a k by k matrix of rank k, then 

T . A)T Q-1 

Remark ; Lemma 3 shoxvs that ip , considered as a function of B, is invariant 
under a non-singular transformation, and also that \p essentially 
depends only on the subspace spanned by the row vectors of B. 

The following theorem is proved in Reference 2. 

Theorem 2; Given two real symmetric matrices A and S with A positive 
definite, there exists a nonsingular n by n matrix R such that 

RAR^ = I 
RSr'^ = D 


Lemma 2 - (H)"^ = [SB*^ - AB 


Lemma 3 - BC-I^)"^ = 0 


Lemma 4 - If B = QB where 


3B 


where I is the Identity and D is a diagonal matrix. 


Rpmark! The elements of D are the eigenvalues of A ^S. 
k 

Theorem 3 - ip where > A 2 • • • - k-largest eigen- 

values of A~^S. Thus is maximized by letting the row vectors of B 
correspond to the eigenvectors associated with the k-largest eigenvalues of 
A”^S. 


Proof: By Theorem 2, there exists a non-singular n by n matrix R such 
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T T -1 

that RAR = I and RSR = D, where the eigenvalues of A S are 

the diagonal elements of D. 

A. A 

We assume B is the the form B = B R where B is a k by n matrix 

/V 

of rank k (certainly this is no restriction, as evidenced if B is chosen 
to be BR . Then 

= -| tr{(BAB’^)“^(BSB'^)} 

= i tr{(BRAR'^B'^)~^(BRSR'^'’^)} 

= ^ tr{(BB'^)"^(BDB'^)} 


By Lemma 3, ip now depends only on the subspace spanned by the row 

A A AT 

vectors of B; thus we can assume B B “ (the k by k identity) and 
the problem becomes one of maximizing 

? = tr{(B D b'^)} 


A At 


A AT 

T> T» 


subject to the constraint B B = Ij^. But given B satisfying B B = 

M A 

"extend B to an orthogonal n by n matrix 


Q = 


(l) 


T T 

where Q Q =1. By Lemma 1, the eigenvalues of Q D Q are those of D. But 

T 

by theorem 1, the il'th largest eigenvalue of B D B is less than or equal 

T 

to the 5,'th largest eigenvalue of QDQ, isA^k. Thus, 

k 

where ^ ^ Aj^ 
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are the k-largest eigenvalues of A ^S, with equality being obtained if 


the rows of B are chosen to correspond to the eigenvectors associated with 

«-l. 


the k-largest eigenvalues of A S. 
k 


QED 


Corollary 1 ,,s ^ 'P and thus ip is bounded belov7 by the k smallest 


eigenvalues of A S. 


j+(n-k) 

- 1 . 


Proof: Follows Immediately from the proof of Theorem 3 and Corollary 1 of 

Theorem 1. 

Remark: In particular, note from Corollary 1 of Theorem 1, the smallest eigenvalue 

-1 T -1 T 

of A S is less than or equal to the smallest eigenvalue of (BAB ) (BSB ) , the 

second smallest eigenvalue of A is less than or equal to the second smallest 

eigenvalue of (BAb'^) ^(BSb'^), etc. 

We use theorem 3 to obtain a tighter upper bound on the so called average 

divergence, defined by (Reference 4) 

m-1 

m 


where 


= j tr{ ^^[(BA^B^)"^(BS^B^)} - ^ k 


A^ ; an n by n symmetric positive definite covariance matrix 
for class i. 


u. ; n-dimensional mean vector for class i. 
1 

6.. ; y.-y. 

1 J 

m ; number of distinct classes. 
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m 


5 

=1= Vu' 


k ; the niomber of rows of B. 


Thus let 


^ a *1 “ 1 • aa • • • “ ^ a a 

1,1 1,2 i,k 


-1, 


be the k largest eigenvalues of S^. Then 


Corollary 2; 

Id. % 


m k 

•^1 ^ 1^-t --l-r, V “ ~ 9~" ~ -^1 ^ 

1=1 j=l i,j+n-k 2 B i=l j=l i,j 


m(m-l) 


It is shown in Reference 1 that ^ D. We now derive this result 

D 

algebraically. Clearly, by definition of Dg, it suffices to show 


Dg(i,j) < D(i,j) 


where the interclass divergence between classes i and j is defined as 
D(i,j) = tr{A^^Aj + - n + -| tr{A^^ + 

and the transformed divergence D„(i,j) is defined as 

D 

= T trifcAaB'^)”^(BA,B^) + (BA.B^)"^(BA B^)} - k 

B '^2^1' j j 1 

+ -| tr{[(BA^)”^ + (BA^b’^)”^](B6^^6^^V)} 

Theorem 4 - D(i,j) 2- Dg(i,j) 

Proof: By theorem 3, it suffices to show 

4 tr{AT^A, + AT^A.} - 4 tr{(BA.B'^)“^(BA.B^) + (BA b’’^)“^(BA b"^)} > n-k 
2ljjl2 1 J J 1 
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-1 


Let ^ . . . > > 0 be the eigenvalues of and let - • • •- Yj^ > 0 

be the eigenvalues of (BA^B^) ^(BA^B*^) 

It suffices to show 

n k 


First note that the function f(x) = x + 1/x is greater or equal to 2 
for X > 0, and that f(l) =2 so that f(x) is strictly decreasing in the 
interval (0,1] and strictly increasing in the interval [1,“). Thus assume 


Yi s Yo ^ Yo ^ 1 i Y 


> . - - s 


1- ,2 ... - - ... - Yj^ 


and the proof follows by noting 


X + 1/X > y. + 1/y. 

i J 1 1 


3 •" 1 , ^ 


*n-J + Vj + TTT J ■ 0 fr-(K+D) 

n-j 'k-j 


^n-j+(Jl+l) ^ X 


n-j+(il+l) 


2 , j = k+1 , . . . , n 


Q .E . D. 

We now reviev; briefly the concept of the square root of a positive 
definite symmetric matrix A. Since A is positive definite, it follows that 
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where QQ = I and the are the strictly positive eigenvalues of A. 

Then, as in Reference 2, we define the matrix as 



hu i'll 

It is readily verified that A k ~ and also that A k ^ kk , Now, 
consistent with the previous notation, let A^ and k^ be n by n 
positive definite symmetric matrices. 

Consider the ratio of the determinants 



It follows from the previous discussion of "square roots of a matrix" that 



= I A + A“^ 


where A. denotes the inverse of A. and A 
1 1 


A = A^*^a/'- 


is defined as 


Note that if x is an eigenvector of A with eigenvalue X, then x is 
also an eigenvector of A~^ 


with eigenvalue 1/X. 
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Thus if A, s A„ ... S X >0 are the eigenvalues of A, it readily follows 
12 n 


that 


= (A^ + l/Ap(A2 + I/A 2 ) .... + 1/V 


= ^n^(A. + 1/A^) 


Now if B is a k by n matrix of rank k, v;e define 




= iSi(Yi + 1/Yi) 


where - Y 2 • • • ^ Yj^ > 0 ^re the eigenvalues of 
We prove 

Theorem 5 




f 




B^T 


Proof: It is shown in the next theorem that B( — ) = 0. Thus we can 

assume as in Theorem 3 

/\ 

B = B R 


where 

RA^R^ = I and RA„R^ = D where D is a diagonal matrix with diagonal 

X M 

elements corresponding to the eigenvalues of A^ A 2 . Then 




I .A Ax ' A i^x ^ ^ ^ At I 

g = I (B B^) (BDB^) + (B D B^) (B B") | 
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and since by the initial remark 


depends only on the subspace spanned 


A A 

by the row vectors of B, it suffices to consider only those B satisfying 


B =1. In this case 


= I (B D + 




2 2 2 ^ 

Thus if ^ Y 2 • • • - ^ 0 are the eigenvalues of B D B and if 

2 2 -1 

> ... - > 0 are the eigenvalues of A 2 , it follows by definition 

A> #vT JiV 

that - Y 2 • • • ~ Yjj are the eigenvalues of (B D B ) and that 

-1 it 

A^ S ... S are the eigenvalues of (A^ ^2/ 

Thus as in Theorem 4, make the following association, with 


Yj, ^ Y 2 • • • ^ Y^ ^ 1 i Y 


Jl+1 * • * ■ ^k 


X. + 1/Xj . Yj + 1/Yj 


J “ If • ♦ • fil 


Vj ^ J ° 


n-j + («,+l) 


j = k+1 n 


In particular 


(k-(£+D) 


'b - jSl^^j jSo 


(A . + 1/A ,) 

n-J n-j 


(n-k) I (k-(£+l» ^ 

< '^^L(A. + 1/A.) .n„ (A . + 1/A .) < A. 

J=1 2 J J=0 -J -J 


Now define the function 


H(l,2) = |ln 


A^^l 
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and 


B(Aj^ + A2 )b'^ I 


Hg(l,2) = jin 


BA^B^I^^I BA2B^ | 




Then by Theorem 5 it Is true that 


Hg(l,2) < H(l,2) 


We use the notation 
element is 


3H2(1,2) 


3Hg(l,2) 

3B 


to denote the k by n matrix whose i-j 


th 


Lemma 5 : 




where b , . is the i-i*th element of B. Then 
ij 


f^sa,2)Y 

[-k—J = <"i ^ 


A2)b'^[B(A^ + A2)B^]”^ 


- ■|[A^B^(BAiB'^)"^ + A2b'^(BA2B^)"^] 


so that B 




Proof; If dA denotes the matrix each element of which is the differential 
of the corresponding element of the matrix A, then from Reference 2 , 


d In |A| = tr{A ^ d A} 


Now considering only the variation in B, 
dln.lBA^B"^] = tr{ (BA^B'^)“^(dBA^B'^ + BA^dB'^)} 
= 2 tr{dBA^B^(BA^B'^)“^} 
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so that 




so that 




+ A2)B^[B(Aj^ + A2)b'^]"^ 


- j [A^b'^Cba^b'^)"^ + A2b'^(bA2b’^)“^] 


Lemma 6 ; Let the row vectors of B correspond to k of the eigenvectors of 

_1 3H (1,2) 

A 2 * Then = 0 

Proof: We choose B such that 

T T 

BAj^B = I and BA 2 B = D where I is identity and D is a k by k 
diagonal matrix of k eigenvalues of A^^^ A 2 . The proof follows immediately 
by noting that 

T T 

A 2 B = A^B D 

2 2 2 2 2 

Remark: Let - ^2 ^ ^S,+l ’ ” “ ^n eigenvalues of 

■^1^^2* suppose that 

X 3 

— max x=l i 1 i=0 n-i n-i 

maximizes the product of any k factors of the form (X^ + 1/X^) ; then by 

Theorem 5 H„(l,2) attains a global maximum by choosing the row vectors of 

B 

B to correspond to the eigenvectors of A^^ A 2 with eigenvalues 

2 


X. 

X 


n-x 


i 1>«»»>3 
X 0, • • ■ ,k""j—l, 
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with the maximum value of Hg(l,2) given by 




Using previous notation, we now define the interclass Brat€acharra distance 
for two multivariate normal distributions as 


c = I 


612612 } + H(l, 2 ) 


and the transformed Bratachara distance as 


r t1-i 

. B(A^ + A„)B 


(B612612V)} + Hg(l, 2 ) 


Let be the only non-zero eigenvalue of 


_ Ai + ^2^ 
1 *^12 I 2 J 


Note that Yi = 


with corresponding eigenvector 


12 


Al ^2^ . 

X 2 f 6^2- 


Thus by the remark following leima 6, it follows 


< i gT PLI ^2 
S " 8 ^12 I 2 




We now prove 


Theorem 6: Let B be a k by n matrix of rank k which extremizes C^. 
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Then it is necessary B satisfy an equation of the form 
T 

= ^6i2<S^2®'^ - (A^+A2)b'^[B(A^+A2)b'^]"^(B6^2‘^^2®^)}[B(A^+A2)b'^]“^ 
+ (A^+A2)B^[B(A^+A2)B^]“^- |[Aj^B^(BA^B^)"^+ A2b'^(BA2B^)'^] 

= o' 




8H^(1,2) 


3B 


Proof ; Immediate by Lemmas 3 and 5 
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INTRODUCTION - This note considers one particular aspect of the feature 

selection problem, that resulting from the transformation x = Bz, where B 

is a k by n matrix of rank k and k < n. Such a transformation can be 

considered to reduce the dimension of each observation vector z. It is shown 

that in general, such a transformation results in a loss of information. In 

terms of the divergence, this is equivalent to the fact that the average 

divergence computed using the variable x is less than or equal to the average 

divergence computed using the variable z. Similarly, a loss of information 

in terms of the probability of misclassif ication is shown to be equivalent to 

h 

the fact that the probability of misclassif ication computed using variable x 
is greater than or equal to the probability of misclassification computed using 
variable z. 

First, the necessary facts relating k-dimensional and n-dimensional 
integrals are derived. Then the above mentioned results about the divergence 
and probability of misclassification are derived. Finally it is shown that if 
no information is lost (in x = Bz) as measured by the divergence, then no 
information is lost as measured by the probability of misclassification. 

The above results suggest that the increase in probability of misclassification 
resulting from the transformation x = Bz can be minimized by minimizing the 
information loss as measured by the average divergence. Thus the equations 
necessary to maximize the average divergence as a function of B are presented. 

It is shown that the information loss between each class pair, as measured by 
the divergence, can be conveniently displayed by a "Class Separability to be 
Gained Map". If this information loss is small enough for each distinct class 
pair, then there is essentially no increase in probability of misclassification 
resulting from the transformation x = Bz. 
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FUNDAMENTAL LEMMAS 


We are interested in relating integrals over k-dimensional regions to 
Integrals over n-dimensional regions. In particular, given some n-dimensional 


space 


we are Interested in comparing the divergence or probability of 


misclassiflcation computed in with the divergence or probability of nis- 
classification computed in where - is any k-dimensional subspace of 

f- 

Consider the following: 


X = Bz 
y = Sz 

Such that 



where 

Q : a real nonsingular n by n matrix 
B : a real k by n matrix 

S : a real (n-k) by n matrix, chosen such that the rows of S are orthogonal 
to the rows of B. 

z : a real n-dimensional vector 

X : a real k-dimenslonal vector 

y : a real (n-k) -dimensional vector 

Script letters will denote a real vector space, so that 
= {z} ; a real n-dimenslonal vector space 
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^ = {z'} ; a real n-dimenslonal vector space 

= {x} ; a real k-dimensional vector space 
= {y} ; a real (n-k) -dimensional vector space 

* The symbol ^ will denote Cartesian Product, so that 

:1 ' / 

Note that any non zero z e can be expressed uniquely as 


where 


z = Zb + zg 


z„ = . ,a.b. 

B 1=1 1 1 

■ j=k+l“j®j 


T T 

and BS = 0 (and of course SB =0) by choice of S. 
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T 

Note that the condition BS = 0 implies 

(i) B(z) = B(Zg) 

(ii) B(z ) = 0 
s 

(iii) S(z) = S(z^) 

(iv) S(Zg) = 0 

Using the above definitions and notation, we prove 
Lemma 1. If then 

Q"^(R^ « s(,p) = b"^(r^) 

Proof; (1) Since Q is non singular, it suffices to show 

R^ « S(p = QB~^(Rj^) 

(2) Let z' e R^ ^ S^’). Then from (i) - (iv) above. 



(3) Since B(z, + z„ ) = B(z ) e R^ , we have (z + z ) e B ^(Rj, 
Ir ^ B ^B 

so that 

R^ « - QB'^CRj^) 
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regions 
density 
Q and 

Pi(z) 

f^(z') 

gi (x) 


(4) Now let z' £ QB ^(R^), so that there exists z, a member 
of B ^(R^) and Q(z) = z’ 

(5) But z = Zg + Zg, and thus B(z) = h(Zg) £ so that 





Thus QB (R^)Cr^® sCy) 


By (3) and (5), it follows R^^ ® S(j) = QB ^(R^) 

■y 


Q.E.D 


Thus Lemma 1 relates k-dimensional regions R^S B(2) with n-dimensional 

— 1 ^ ^ 

Q (Rj^ ® S(^). It is convenient at this time to consider the following 

functions, all related, for fixed i, in a sense, by the transformations 

B. Define: 


the density function of the i'th class. We write p^(z) =^/(y^,■^) 
to denote that the i'th class is normally distributed with mean y. 
and covariance . 

the transformed density function for the i'th class resulting from 
the transformation z' = Qz. Thus f.(z') = N(Qy. j^q"^) and we will 
use somewhat inconsistent notation in denoting f^(z') by f.(x,y) 
where z ' = / ^ ) . 

the transformed density function for the i'th class resulting from 

rn 

the transformation x = Bz, Thus g^(x) = N(By^,B5^^B'^) . 


It is shown in Reference 1 that 


- 3f,- (*.y)‘>y • J 
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so that g^(x) is the marginal density of x. This fact is expressed in 
Reference 1 as: 

THEOREM 2,4,3 - If z' (a random variable) is distributed according to 
N(Qy.,Q^ q'^) , the marginal distribution of any set of components of z’ is 
multivariate normal with means, variances, and covariances obtained by taking 
the proper components of Qy^ and respectively. 


Note that since 



s^.s'^ 

1 



T <• T 

the proper component of is , and the proper component of Qy^ 

is By^. 


rr 


LEMMA 2 - Let Then '*^g,(x)dx 

' ^1 ^ 


Proofs Sg.(x)dx = ^ 


dx 




= R 


s 

1 -es 




f . (z')dz’ 


= p.(z)dz 

Q S^)) 

p. (z)dz 
(Rl)^ 



( 

\ 

J p (z)dz. 
B“-^(R^) 


(by definition of g^(x) 


(by definition of the integral) 


(by definition of f^(z') and p^^(z)) 


(by LEMMA 1) 
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SUFFICIENT STATISTICS AND THE PROBABILITY OF MISCLASSIFICATION 

We assume the existence of m-classes, each N(y_]fc. ). Let the vector 
spaces ^ previous section. Using a maximum liklihood 

classification procedure, it is possible to partition each of the above spaces 
into disjoint sets, and thus compute the probability of misclassif ication. 

Thus let 


pmc 


pmc^ 


pmc, 


We 


; the probability of mlsclassification in resulting from a 

maximum liklihood classification procedure. 

/ 

: the probability of mlsclassification in j resulting from a 

/ 

maximum liklihood classification procedure. 

: the probability of mlsclassification in 'j/- resulting from a 

maximum liklihood classification procedure. 

are interested in comparing pmc, pmc^, and pmCg. It will be shown 


that 


pmc_ S pmc = pmc„ 
a W 


REMARK; If pmc„ = pmc, then B is said to be a sufficient statistic (for 

■ O 

the probability of mlsclassification) 

It is convenient to define the following sets: 


N^(z) = 

{z|p^(z) > Pj(z) J j=l,...,m 

and 

j 

^ i} 

N^(z') = 

= {z'lf^(z') > f^ (z')j j=l, . . . ,m 

and 

j 

^ 1} 

K.(x) = 

{x|g^(x) > gj(x) / j=l m 

and 

j 

^ 1} 


Initially, consider the two class problem corresponding to the case m = 2, 
and assume (to be true up to a set of measure zero) that 
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/ - »l “ «2 


/ 


-Ni U 


= u 


Then by the definition of the probability of misclassif ication as dis- 
cussed above (Reference 1 ) 


pmc = ^ p^(z)dz + ^^P2(z)dz 


pmcQ = :Z f^(z')dz’ f2(z*)dz' 

2 1 


S 5 

pniCg gj^(x)dx + ^ g2(x)dx 

REMARK - We have omitted the apriori probabilities, as they will be assumed equal. 
Moreover, it is shown in Reference 1 that if v' “ Mi u M„, C’ = M, u M^, 

V 1 2’/ 1 ' 

and J~' = u L2 , then 


pmc 


^p (z)dz +^p„(z )dz 
M^ 1 Mi z 


s ^ 

pmcg ^ £ g^(x)dx + ^g2(x)dx 


REMARK - Since Q is nonsingular, it is easily verified that 


P±(z) f^(z') f^(Qz) 


Pj(z) 


i,j 1 ,... ,m 


so that the "liklihood ratio" is invariant under a non-singular transformation, 


and thus 
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= Q(N^), which results in 


pmc = , 


since for an arbitrary set M, 




THEOREM 1 - Assuming the existence of 2 distinct classes, then 


pmc 


g > pmc = pmcQ 


with equality <=^ B ^(K 2 ) = N 2 and B ^(K^) 
where) . p 

Proof: pmcg = g^(x)dx + ^g 2 (x)dx 


a.e. (a.e. denotes almost every 


C 


3 


s 

Odz + p 


n(z)dz+ _ - J p„(z)dz 
B“-^(K2) B -^(K^) 


(by Lemma 2) 


S pmc 


where the last inequality follows from the definition of pmc and the fact 


B“^(K2) u 


= B u K 2 ) 

= 

i.n 

V 
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It is immediate that ^ pmc with equality <=> B ^(K 2 ) - N 2 and 

a.e. 

Q • K • D • 

COROLLARY 1 - Assuming the existence of m distinct classes, then 


pmc 2 pmc = pmc^ 

D I 


with equality .<=> B ^(K^) = ; i - l,...,m, a.e. 

Proof: Let - denote the set theoretical compliment of K^. Then (as 

in Reference 1), by definition of pmc^, 

m 

Pmcg = ^ g. (x)dx 


i=Y-K.n 

3 

B 

1 P, (z) 


m 


i=i „ .Pi(z)dz 


m 

_ T 




dz 


m ' C 

> 1 Jp ( 
1-1^ 


z)dz = pmc 


,-l- 


remark - Note that B (K^) = is equivalent to 


Q.E.D. 


p^(z) > Pj(^) g^(Bz) > (Bz) j - l,...,m 


a.e. j^i 


which is certainly Implied whenever 


P^(z) g^(Bz) 

P.(z) gj (Bz) 

J 


j = 1, . . . ,m 


a.e. 



11 


COROLLARY 2 Assuming the existence of m distinct classes, then 


pmCg ^ pmc 


with equality < — the following holds a.e. 


p^(z) > pj(z) <=> g^(Bz) > gj(Bz) 




1 < i ^ m 

Lemma 2 and Corollary 2 suggest that in a sense, (with respect to probability 
of misclassification) we have never left the original space | . The trans- 
formation X = Bz, combined with the g^(x) and the maximum liklihood 
classification procedure can be thought to define a decision function which 
partitions the original space J''" into disjoint sets. The transformation B, 
in this sense is used essentially to quicken the classification procedure. 
Equivalently, the transformation B can be considered as a rule which results 
in the grouping together of points (vectors) in the space j/ . For example, 
let X e K , and define 

S = {z]z e ^ and Bz = 

SO that members of the space | are grouped together in the set S. Yet 
associated with “T is only one particular class, namely that class into 
which is classified using a given classification procedure (assumed to 

take place in^). Thus we can express Theorem 1 verbally by saying that in 
general, the grouping together of vectors results in a loss of information. 

The above discussion suggests the possibility of defining (conceptually) 
general classification functions of the form 
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h^(<|)(z)) 


X" 1 } • • • y m 


where <|)(z) is a vector, with (p not necessarily being a linear transformation. 
Certainly, to be useful, such functions must possess the following properties 


(i) The class of functions h^((J)(z)) 
is more easily evaluated than the 
class of functions p^. (z) 

P,- (z) 

z) 

A' 


1 1 , • . 


.m 


(li) 


^1 P^(5 


h^((|>(z)) 
hj (<j)(z)) 


dz is small for all i,j, 


Note that the size of 


3:. 

ij 


between classes i and j , 



Vi,j implies 


can be thought of representing the information less 
resulting from the transformation <})(z). Certainly 


p^(z) h^(4)(z)) 

Pj (z) “ h^ (c()(z)) 


Thus if a classification rule is defined by 

())(z) be classified into class i if and only if 
h^(cf)(z)) > hj(cj)(z)) j=l m 

1 < 1 < m , 

no information is loss by using the generalized classification functions 
h^(4)(z)) whenever = 0 Vi,j. 
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SUFFICIENT STATISTICS AND THE DIVERGENCE 


We begin with the necessary definitions, with all notation consistent 
with the previous two sections. Consider the existence of two distinct classes, 
and define as in Reference 2 the mean information, for discrimination in favor of 
population one against population two (for a particular vector space) as 


Pi (z) 


1(1,2) = Jp^(z)log^^ dz E Jp^(z)log^^ dz 


1 




fl(z') 


) 

<. g,(x) r 

Ig(l,2) = ,.Jgj^(x) lo g-^^-y dx = J)g^(x)log 


gi (x) 


% 


’g2(x) 


Then the interclass divergence (again in a particular vector space) is defined 
(Reference 2) as 


D(l,2) = 1(1,2) + 1(2,1) 
Dq(1,2) = Iq(1,2) + Iq( 2,1) 
Dg(l,2) = Igd.Z) + Ib(2,1) 


We will show that 


Dg(l,2) < D(l,2) = Dq(1,2), 


if and only if 


Pj^(z) g2^(Bz) 


with equality 


Xt follows Immediately from Corollary 2 of Theorem 1 that Dg(l,2) D(l,2) 
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implies that P^Cg = pmc 

To prove the desired inequality, it is necessary to state the following 
theorem and corollary from Reference 2. 

THEOREM 2 (KULLBACK) ; 1(1,2) is almost positive definite, ie 1(1,2) ^ 0 

with equality <=> Pj^(z) = a.e. 


COROLLARY 1 


equality iff 


“ (jPl(^>dz)log 

Pl(z) 

= 1 a.e. 

P2(z) 


dz 

dz 


with 


REMARK; The above Theorem and Corollary also hold if 1^(1, 2) or 1^(1, 2) 
and the corresponding density functions are considered. 


We now prove 


f^(z') g^(Bz) 


THEOREM 3 - Iq( 1,2) > Ig(l,2) with equality if and only if = g^^g^) a.e. 

Thus in particular 1^(1, 2) = 1(1,2). 


Q fi(z’) 

PROOF; (1) Iq( 1,2) = 0/j^(z»)log ^^^^, - y dz’ 


I 


'P fj^(x,y) 

on f.(x,y) 

= (. j f^(x,y)log (^-y-ytiy)dx 

(2) It is shown in Reference 2 that Corollary 1 of Theorem 2 holds 
for any pair of density functions. Thus define 

f^(x,y) fj^(x,y) 

\x^^^ "■^f^(x,y)dy ~ g^(x) 

€ 
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and 


so that 


(3) It follows from the corollary that 

V ^1 'N 

) V, 1 iii J,, •> I ) 


>S”h- (y)dy 

T- 


f T i.j' 




fj^(x,y) 


0 fi(^.y)lo8f 7--)- dy S 

f 

and for all x, we have 




I 


,} f^(x,y)dy 

■f 



f 


C f-, (x,y) g. 

J f 3 ^(x,y)log dy > g^(x)log — 


gi(x) 


(x) 


(4) Thus from (1) and above, we have 

<? gl(x) 

Iq( 1 , 2 ) > Jg^(x)log dx = Ib( 1 , 2 ) 

(x,y) gi (x) 


\ f f,(x,y) J ] gl<x) 

,S gi<x) ,/ P A 

}f 


dx 


gj(x) 


■/ISiWlog^J^dx 


^>L 

i 


^(x,y)dy 

2(x,y)dy 


Q.E.D. 
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COROLLARY 1 D.(l,2) = D(l,2) > D^(l,2) 

Pl(z) g,(Bz) 

with equality if and only if = / ■ \ a.e. 

P 2 <z) 82 (^ 2 ) 


Remark ; If D (1,2) = D„(l,2), then B is said to be a sufficient statistic 
for the divergence. 


We now investigate the condition 
Pl(z) gj^(Bz) 

P2<z) " 82(82) 


Note that if is the covariance for the first class, then 


= 


where 


Similarly, 




V 


’S T 

s ^28 


' T 

B/..B 

B i- \ 

1 i C- , 

c,„ \ 

1 

1 ' 

/ 11 

12 \ 


S f 

- j 

1 Si 

V. 

1 

C / 

1 

2 / 

22 , 

B^ 28 ^ 

\ 




\ / ”11 

”12 \ 


S ^2^^ 


) i 


\ ^21 


°22 ' 


where D ^2 = ^21’ Getting |q^»^Q^| denote the determinant, it follows 


Iq^Q^I = |C3^il*|C22 - ^21 ^11 C 


12' 


IQ^2^ I l°22l’''l°22 " ^^21 °11 *^12 



17 


X 

To see this, consider under the nonsingular transformation 


Where 


and 


R = 



so 


that R = 1 


n-k/ 


IqI^q'^1 = 1RQ^-,Q‘R 


I = 

r 


"11 0 


\ 


^22“^21^11^12 / 


T* T* 

Also, since RQ'^j^Q R is positive definite, so is the symmetric matrix 


S 2“^21 *^11 ^ 12 * 


Now define the positive definite matrices 

j ' 

‘"22*1 ^22 " ^21 Si ^12 

°22*1 " °22 “ Si Si S 2 

so that 

IqI^q^I = Ic^^l lC22.1l 

1Q^2^ 1 “ I ^11 1 ^S2*l^ 

Now define the matrices and H 2 by 
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«1 = 


4ii C^2 ^22*1 ^21 Si 


-1 




\ -c ^ 

C„ C 

\ 22«1 

21 

4- 

-1 

”11 ”12 

S2*l 


\ “°22.1 

D 21 D 


D 

-1 


-1 


-c"^ c C"^ 

11 12 22rl 


-1 


^22»1 


'Si °12 °22*1 


22#1 


It is easily verified that 


‘ 0 


+ H, 


and that 


(Q V>'' '' 


+ R. 


Now let y' = Qy. and y = By. 

i X 1 

so that 




T.,-1 


(z' - y'^) ) (z’ - v\) = (x - y^^) (x - y^^) 


+ (z’ - y’j^) (z* - y'j^) 


and also 


(z' - y'2)^(Q^-2Q'^)"^(z’ - y’2> = (x - ^^2^X1 " ^2^ 


+ (z' - y'2) H2 (z' - y’2) 
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Now, by definition, 

f^(z) exp £- i (z'-yp’^(Q^^Q’^)“^(z'-yp J 

^ 2 ^ |q21^q'^|^ exp jT- jz(z'-yp’’'^(Q^ 2 Q^)“^Cz'-UpJ 


°11> l°22-’l 


/, exp 


^Z(x-y^ exp jj- '^■(z'-up'^^j^Cz’-upJ 


ICiil IC22.1l J J L~ 

^ /|B2i.il f 

g2(x) \^|C22.i|j exp£-^i.(z’-yp^H2(z’-ypJ 


f.(z') p,(z) 

Since \.— r~ i T ~ — 7~T > it follows from Corollary 1 of Theorem 3; 

f2(z') P2(z) 


THEOREM 4 - D„(l,2) = D(l,2) if and only if 

■ ■' ■■ — D 


/ ^ ■ ^ 
^ exp |_- ^l(z'^yp^H^(z'-yp j 

' exp[_- 'A.Cz'-yp^H^Cz'-ypJ 


for all z' = Q(z) . 


Corollary 1 - Dg(l,2) = D(l,2) if and only if = H 2 and H^Q(yj^-y 2 ) = 0 

<■ S X -1 

Corollary 2 ~ ^2 ~ D(l,2), where a “ '^l (hi~y2^ 

Proof: 1 = ^2 selecting each row vector of S orthogonal to yj^-y 2 » 


that C ^2 = °12 = ° 


Q .E .D . 
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REMARK - Theorem 3 reveals the importance of the equality: 

fl(z') g^(Bz) 

£ 2 ( 2 ’) “ g2(Bz) 


we note the following Lemma, proved initially by Halmos: 
LEMMA 3 - If g is a real-valued function on then 


^!) g(x)g. (x)dx = ,.3g(Bz)p (z)dz 

X 0*7^ X 


71 




i=l,2 


Using Lemma 3, it is easily verified that 

D(l, 


<?/ p, (z)g2<Bz) P2(z)g^(Bz) 

,2)-D3(1,2) Yp^(.)lcg + PjWlog 

if ^ 


dz 


we now prove 
LEMMA 4 5fei 




(Bz)p 2 (z)-g 2 (Bz)p^(z)} dz = 0 

Proof: J^gj^(Bz)p 2 (z)dz = 3gj^(x)g2(x)dx 

/ 

J c 

= ,Jg2(x)gj^(x)dx 

'X- 


= :> 

J 


g2(Bz)pj^(z)dz 


•V 


Q .E .D. 
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THE AVERAGE DIVERGENCE 


The interclass divergence is a measure of the degree of difficulty of 
discriminating between two classes or populations. However, the general 
feature selection-classification problem involves measuring the separation 
between m-classes. This section presents the average divergence of m-classes 
as a natural generalization of the interclass divergence. The average 
divergence is shown to be a measure of the separation between m-classes. 

Finally, the average divergence is related to the probability of misclassificatioon. 

We assume three distinct classes, normally distributed, although the 
generalization to m distinct classes is immediate. Follov/ing a procedure 
similar to that of Reference 2 for the interclass divergence, define: 


P(H.|Z) 


q^p^(z) + q2P2(z) + q3P3(z) 


1 = 1,2,3 


where is the apriorl probability of z belonging to class i. Thus it 

follows: 


Pj^(z) P(H^|z) 

^ - log - log ^ 


log 


Pl<z) 

P3(z) 


P(H |z) q 

log log — 

P(H3lz) 'll 


Now define the functions : 

PfCz) Pj^Cz) 

s- (z) = log 7 —r + log T—r 

1' ^ P2(z) P3(z) 

p|(z) 

S2(z) - log p^(2)p3(z) 

P^Cz) 

p'^(zTp7(iy 


pj(z) 


= log 


P2(z)p3(z) 
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It is easily verified that ~ {sj^(z), 82 ( 2 )* s^Cz)} if and only if 

Pj(z) = max {p^(z), P 2 (z), p^Cz)}. 

i = 1 to 3 implies it 

6’‘0 

is more likely z belongs to class j. We define s^^(z) as the information 
in z for discrimination in favor of dlass 1 against class 2 or 3. 


Thus ® j ^ s^(z) 


The mean Information for discrimination in favor of class 1 against class 2 or 
3 as measured by class 1 is 

1(1:2) + 1(1:3) = p . (z)s . (z)dz 

I" ^ " 

■/ 

Similarly, the mean information for discrimination in favor of class 2 against 
1 or class 3 as measured by class 2 is 


1(2:1) + 1(2:3) = J P 2 (z)s 2 (z)dz 


Finally, the mean information for discrimination in favor of class 3 against 
class 1 or class 2 as measured by class 3 is 


1(3:1) + 1(3:2) = ^ ^ ^{.z) s Az 

Thus we define the average divergence D as 

D = 1(1:2) + 1(1:3) + 1(2:1) + 1(2:3) + 1(3:1) + 1(3:2) 

= tl(l:2) + 1(2:1)] + [1(1:3) + 1(3:1)] + [1(2:3) + 1(3:2)] 
= D(l,2) + D(l,3) + D(2,3) 
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where D(i,j) Is the interclass divergence between classes i and j. In 
general, for m distinct classes. 


m 


° i^l j=i+l 


m 

2 - 




Thus the average divergence D is a measure of the total divergence between 

the classes 1 thru m, and as such is a measure of the difficulty of discriminating 

between them. 

Using the notation of the previous section, it follows the k-dimensional 
B-average divergence resulting from the transformation x = Bz is 


m-1 

= 1 
i=l 


m 

j=i+l 


Dg(i,j) 


We now prove 


THEOREM 5 - D = D„ => pmc = pmc„ 

■ ■■ D H 


Proof; (1) Assume D = D^. By Corollary 1 of Theorem 3, D(i,j) ^ Dg(i,j) Vi,j 
so that is must be true D(i,j) = Dg(i,j) Vi.j 

(2) By Corollary 1 of Theorem 3 


PfCz) , g.(Bz) 

D(i,j) = Dg(i,j) <=> 


a.e. V*l,j 


(3) By Corollary 2 of Theorem 1 

pmc = pmCg 
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FEATURE SELECTION - AN EXAMPLE FROM THE Cl FLIGHT LINE. 

Theorems 3 and 5 suggest that a possible feature selection criterion is 
the B— average divergence Since the difference ^ 

measure of the information lost in performing the transformation x = Qz, 
Moreover, Theorem 5 suggests that the difference ^ measure of the 

difference of two classification maps (for the same field) — one generated 
using maximum likelihood classification on the g^(Bz). By Theorem 5, the two 
classification maps will be the same if ~ Also, by Theorem 1, the 

classification map generated using p^(z) is the best classification map 
possible (with respect to probability of misclassif ication) , so it makes 
sense to try and make the classification map generated by the g^(Bz) agree 
with that map generated by the p^(z). Thus our feature selection criterion 
is stated simply as 

mgx Dg 


where B is a k by n matrix of rank k. If the m classes are normally 
distributed with means and covariances A^, then it is shown in Reference 


3 that 


m-1 


= 


m 


i=l j=i+l 


m 


= |tr{^^[(BA.B^)"^(BS.B^)]} 


m(m-l) 


where 


m 


S. = 




+ 6..6M 

ij ij 


6 . . 
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Let 


8B 


denote the matrix whose i-j th element is 


9Db 

-TT , — , where b. . is 

3bi. ij 


the i-j th element of B. Then it is shown in Reference 3 that 


m 


(^) = [S^B^ - A^B^(BA^B^) ^(BS^B^)](BA^B^) ^ 


/ 3Dg'\ T 

Using the above expressions for and » it is possible to 

maximize D using any of the many existing optimization algorithms. One 

can graphically display "separability" using what we will call a "Class 

Separability to be Gained Map" (Reference 5). Consider a coordinate system 

whose ordinate (for a given value of k) is D (i,j) where now B is assumed 

to maximize D„. The abscissa is the value of D(i,j), in the original 
B 

space, and for a given i-j pair, represents the separability between classes 
i and j. Since D(i,j) S D„(i,j), the distance of a given point from the 
diagonal line D(i,j) = Dg(i,j) represents the separability to be gained for 
that class pair. Thus for a given class pair, its location along the abscissa 
is fixed, and as k increases, the point corresponding to that class pair can 
only move vertically toward the diagonal boundary. Obviously, for large 
enough k, all the points will lie on the diagonal boundary. 
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SYMAT, COVAR - TEST PROCEDURES FOR MATRIX CALCULATIONS 

W. L. Morris 
University of Houston 


The following is a description of the FORTRAN subroutine SYMAT and 
related FORTRAN subroutines. This description is intended to supplement the 
comment statements that appear in the accompanying FORTRAN program listing. 
Included in this listing is a DEMO PROGRAM in which various applications of 
subroutine SYMAT are illustrated by particular examples. 

Subroutine SYMAT operates on a real s}niimetric matrix A(N,N) and 
produces an orthogonal matrix W(N,N) of approximate eigenvectors of A 
along with two vectors C(N) and R(N). The components of C are approxi- 
mate eigenvalues of A and the components of R are absolute error bounds 
for the approximate eigenvalues. For example, if for some index I the 
values of C(I) and R(I) are 10.0 and 0.0001 respectively then there 
is an eigenvalue of A in the interval (9.9999,10.0001), or, equivalently, 
the maximum relative error in C(I) is R(I)/C(I) which in this case is 
0.00001, that is, C(I) is correct to within one part in 100,000. The unit 
eigenvector associated with C(I) is the Ith column of W. In the output of 
SYMAT the entries in C are ordered with C(l) the largest and C(N) the 
smallest in absolute value. The entries in R as well as the columns of W 
are arranged to correspond with the indexing of C. 

Another input parameter in SYMAT, denoted by REL, allows the user to 
specify a desired relative error in the approximate eigenvalues of A. The 
actual relative errors produced by SYMAT are a function of the matrix A and 
the word length of the computer in which SYMAT is executed. The best relative 
errors are produced by assigning to REL the value of zero. When executed on 
an IBM-360 using single word (four byte) arithraatic the smallest values of the 
relative errors that can be expected consistently are on the order of 
0.000005, but this could be improved by executing SYMAT in a computer with a 
longer word length or by coding SYMAT to operate in double word arithmetic. 
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The theoretical basis for SYMAT is presented in the reference: 

W. L. Morris, Inclusion theorems for a section of a matrix, 

Numer. Math. 18 (1972) , 457-464. 

In essence SYMAT is an iterative algorithm in which the problem of finding 
eigenvalues and eigenvectors of a real symmetric matrix is transformed into 
an equivalent problem of finding eigenvalues and eigenvectors of an infinite 
sequence of matrices of order two. Within SYMAT it is important that rounding 
errors be carefully controlled, especially in computing inner products of 
vectors. For this reason function SUPSUM is used to add the components of a 
vector which are ordered by subroutine ORDER. These subroutines are used 
within subroutine MATMUL which computes matrix products. In addition to being 
used with SYMAT, each of the above subroutines can be used in other applications. 
The remaining subroutine called by SYMAT is subroutine MINDEX which is used to 
select the order of operations within SYMAT. 

The DEMO PROGRAM also contains a subroutine COVAR which uses subroutine 
MATMUL to compute the covariance matrix (denoted by A) of a data matrix 
(denoted by X). Since a covariance matrix is S 3 mimetric it can be analyzed 
by using subroutine SYMAT. Also the DEMO PROGRAM displays the following 
applications of the output of subroutine SYMAT: 

1. an approximate inverse of A is computed; 

2. a condition number of A is computed; 

3. an approximate determinant of A is computed along with a bound 

for the absolute error in the computed det(A); and 
T 

4. the row norm of W W - I is computed. 

These four items are computed in a straightforward way. If W is an orthogonal 

matrix of eigenvectors of A and D is a diagonal matrix of (properly ordered) 

eigenvalues of A then AW = WD so that A ^ = WD ■V. The spectral condition 

number of A is the ratio of the largest to the smallest eigenvalue of A. 

The magnitude of the condition number indicates the quality of the computed 

inverse of A. The determinant of A is the product of the eigenvalues of 

A so that the approximate eigenvalues, C(I), along with the error bounds, 

R(I), can be used to compute det(A) and its associated error bound. Finally, 

T 

since W is orthogonal the row norm of W W - I is computed and indicates 
the quality of the computed eigenvectors of A. 
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ABSTRACT 


The Nearest Neighborhood (NN) rule is nonparametric, 
or distribution free, in the sense that it does not depend 
on any assumptions about the underlying statistics for its 
application. The k-NN rule is a procedure that assigns 
an observation vector z to a category F if most of the k 
nearby observations are elements of F. The Condensed 
Nearest Neighbor (CNN) rule may be used to reduce the 
size of the training set required to correctly categorize 
all the elements of the training set. 

The Bayes risk serves merely as a reference-the limit 
of excellence beyond vAiich it is not possible to go. The 
NN rule is bounded below by the Bayes risk and above by 
twice the Bayes risk. 



Let us begin with a brief explaination of the dis- 
crimination problem. For convenience let us consider the 

two population case. Let x, ,x-/...x be samples from the 

12m 

q-variate distribution F ; y2'y2'***^n samples from the 
q-variate distribution G, and z be an observation vector 
such that z is an element of the union of F and G. The 
problem is to decide whether z is an element of F or of G. 
In [l] the discrimination problem is classified in three 
categories; 

1 . ) F and G are completely known . 

2. ) F and G are known except for the values of one 

or more parameters. 

3. ) F and G are completely unknown, execpt possibly 

for assumptions about existence of densities, ect. 

In this paper we will concern ourselves with the sol- 
ution of category three of the discrimination problem by 
means of the minimum distance classifier, commonly referred 
to as the nearest neighbor (NN) rule. Fix and Hodge [l] 
and [2] investigated the k^-nearest neighbor rule. It 
assignes to an unclassified observation vector the class- 
ifacation most heavily represented amoung it's k^ nearest 
neighbors from a previously classified set of points. 

They established the consistency of this rule for sequences 

k— in such a manner that k /n*^^ as n-r»«. In [3l 
n n 

T. M. Cover and P. E. Hart showed that for any number 
n of samples the single-NN rule (k^=l) has a strictly 


lower probability of error than any other k^-NN rule in 
those distributions for vhich simple decision bovindries 
provide complete separation of the samples into their 
respective categories. In [4] P, E. Hart proposes the 
use of the Condensed Nearest Neighbor rule (CNN) which 
retains the basic approach of the NN rule without imposing 
the stringent storage requirements of the NN rule. 

What are the best results we can possibly obtain 
from these procedures? In [2-6] in one way or another 
the authors concluded that the minimum probability of 
error of the NN rule is bounded below by the Bayes 
probability of error and above by twice the Bayes pro- 
bability of error. Where the Bayes probability of error 
is the minimum probability of error over all decision rule 
taking the underlying probability structure into accovint. 
Then if the density functions f and g corresponding to 
P and G are known, the discrimination should depend only 
on f (z)/g(z) where z is an observation vector. With the 
following rule for some c > 0 

If f(z)/g(z)>c then z£F 
If f (z)/g(z) < c then z £ G 

If f (z)/g(z) a c then the decision may be made in an 

arbitrary manner . 

This procedure known as the likelihood ratio procedure, 
L(c), is known to have optimvim properties with regard to 
control of probability of misclassification. The two 



choices of c suggested are: 

1 . ) Take c=l 

2. ) Choose c so that the probabilities of error 

are equal. 

In [l] Fix and Hodge define the idea of consistency 
in the sense of performance characteristics, in the sense 
of decision f\mction, and with the likelihood ratio. They 
also proved the following theorem: 

If f(z) and g(z) are consistent estimates for f(z) 

and g(z) for all z except possibly zez^ where 

* ^ A ^'9 

P^(Z^ g^~^ i=l, 2, then L (c,f,g) is consistent with 

L(c). 

* ''a 

Where L (c,f,g) is the likelihood ratio of the estimated 
values f(z) and g(z) of the density functions f(z),g(z). 

The problem now is to find consistent estimates for 
f and g. In [l] on pages 13 - 20 two procedures are pro- 
posed and of the two proposed the second or alternate 
procedure is recommended by the authors. This is a quote 
of the paragraph on page 20 of [l] in which the authors 
explain the alternate procedure. 

"Choose k> a positive integer which is large but small 
compared to the sample sizes. Specify a metric in the 
sample space for example ordinary Euclidean distance. 

Pool the two samples and find, of the k values in the pooled 
samples which are nearest to z, the number M which are X's. 
Let N = k-M be the number v^ich are Y's. Proceed with the 
likelihood ratio discrimination, using however H/m in place 
of f(z) and N/n in place of g(z). That is, assign Z to 



F if and only if 


II 


If the above procedure is combined with the CNN rule 
proposed by P. E. Hart we develop the following algorithm. 
Before describing the CNN rule let us define a consistent 
s\ibset as a subset of the training set which, when used as 
a training set for the NN rule, correctly classifies all 
of the remaining points in the training set. A minimal 
consistent subset is a consistent subset with the minimum 
number of elements. The CNN rule uses the following al- 
gorithm to determine a consistent subset of the original 
sample set. It should be noted, however, that this sub- 
set is not necessarily minimal. We assume that the 
original sample set is arranged in some order; then we 
set up bins called STORE and GRABBAG and proceed as follows. 

1. ) The first sample is placed in STORE. 

2. ) The second sample is classified by the NN rule, 

using as a reference set the current contents 
of STORE. If the sample is classified correctly 
it is placed in GRABBAG; otherwise it is placed 
in STORE. 

3. ) Proceeding inductively, the ith sample is clas- 

sified by the current contents of STORE. If 
classified correctly it is placed in GRABBAG; 
otherwise it is placed in STORE. 

4. ) After one pass through the original sample set. 



& 


the procedure continues to loop through GRABBAG 
until termination which, vdiich can occur in one 
of two ways: 

a. ) The GRABBAG is exhausted, with all its 

members now transferred to STORE. 

b. ) One complete pass is made through GRABBAG 

with no transfers to STORE. 

5.) The final contents of STORE are used as training 
points for the NN rule; the contents of GRABBAG 
are discarded. 

Next we choose a positive odd integer k vdiich is large but 

small compared to the sample sizes. With the Euclidean 

distance we find the k values in the pooled samples vhich 

are nearest to z. Let M denote the number of samples 

belonging to F, and N=k-M be the number of samples belonging 

to G. Proceed with the likelihood ratio discrimination, 

using however M/m in place of f(z) and N/n in place of g(z). 

That is, assign z to F if and only if 

M ^ N 
— > c — 
m n . 



Some of the advantages of the NN rule are that under 
very mild regularity assvimptions on the underlying statistics, 
for any metric, and for a variety of loss functions, the 
large-sample risk incurred is less than twice the Bayes 
risk, and if the populations are either not well known; 
or have very different covariance matrices; or if the 
discrimination is one in which small decreases in probability 
of error are not worth extensive computations, then the 
k-NN rule with k ^ 3 should be used. 

Some of the disadvantages of the NN rule are that if 
the population to discriminated are well known, and have 
been investigated to establish that the normal distribution 
gives a good fit and that the variance and correlations do 
not change much when the means are changed then better 
results can be obtained by the linear discriminant function. 
From a practical point of view, however, the NN rule is not 
a prime candidate for many applications because of the 
storage requirements it imposes. Also in using the CNN 
rule to find a consistent subset and if the Bayes risk is 
high then STORE will contain essentially all the points in 
the original sample set. 



References 


[1] E. Fix and J. L. Hodges, Jr., "Discriminatory analysis, 
nonparametric disceimination, " USAF School of Aviation 
Medicine, Randolph Field, Tex., Project 21-49-004, 

Rept. 4, Contract AF41 (128) -31, February 1951. 

[2] -- — , "discriminatory analysis: small sample perform- 

ance, "USAF School of Aviation Medicine, Randolph Field, 
Tex., Project 21-49-004, Rept. 11, August 1952. 

[3] T. M. Cover and P. E. Hart, "Nearest neighbor pattern 
classification, " IEEE Trans. Information Theory, vol. 
IT-13, pp21-27, January 1967. 

[4] P. E. Hart, "The condensed Nearest Neighbor Rule" IEEE 
Trans. Information Theory, pp. 515-516, May 1968. 

[5] T. M. Cover, "Estimation by the Nearest Neighbor Rule" 
IEEE Trans. Information Theory, vol. IT-14, pp. 50-55, 
January 1968. 

[6] Terry L. Wagner, "Convergence of the Nearest Neighbor 
Rule" IEEE Trans. Information Theory, vol. IT- 17, 

pp. 566-571, September 1971. 





DEPARTMENT OF MATHEMATICS 

UNIVERSITY OF HOUSTON HOUSTON, TEXAS 




PREPARE^ FOR 

earth observation division , JSC 
UNDER 

CONTRACT NAS-9-12777 


3801 CULLEN BLVD. 
HOUSTON, TEXAS 77004 



Computational Forms for the Transformed Covarience 
Matrix of Multivariate Normal Population 


by 

Mary Ann Roberts 
University of Houston 
Department of Mathematics 

/ 


Report #19 


Contract NAS-9-12777 


November 1972 



h6 


Computational Forms for the Transformed Covarience 
Matrix of Multivariate Normal Population 


Let B be a kxn matrix and use the notation ( )* for the conjugate trans 
pose. In our case the conjugate transpose is simply the transpose, denoted 
by ( )T, The properties of the conjugate transpose used here are: 


B** = B 

(A + B) = A* + B* 

(aB)* = aB* where a is a scalar, a, its conjugate 
(BA)* = A*B* 

BB* = 0 => B = 0 


The following matrix equations will define the generalized inverse of B. Let 
X be an nxk matrix having the properties that: 

BXB = B 
XBX = X 
(XB)* = XB 
(BX)* = BX 


Then X is called the generalized inverse of B, denoted by X - B*. It 
can be proved that for any B there is such an X, in fact a unique X. [1] 
Some of the properties of B^ are: 


-H- 

B = B 


* + 

B = B 

+* 

bb"^ = I 

if B is kxn 

+ 

BB and 

B^B are each 

(aB)"^ = 

a ^B^ where a 


of rank k 
idempotent (XX = X) 
is any non zero scalar 



2 


, * + +* 

(B B) = B B 

If B is normal (BB* = B*B) then b'*^B = BB'*' and (b")'*' = (B )” 

* j. * * 

(BB*) BB = BB 

4- * 4- * * A.4- 

B = (B B) B = B (BB ) 

AB = 0 ■= b‘*‘a^ = 0 

= A ^ if A is non-singular 

T —X 

We are interested in (B£B ) , which exists if we restrict ourselves to a 

matrix B which is kxn of rank k. For non-singular matrices (AB) = B A 

but unfortunately this result does not hold in general for generalized inverses. 

A necessary condition that (AB)"*" = b"*^a"^ is that A A and BB commute. A 

sufficient condition that the equation hold is that A be of full column 

rank and B be of full row rank. The following are necessary and sufficient 

conditions that (AB)'*’ = b"*^a"*" : 

+** * +* *. 

A ABB A = BB A and BB A AB = A AB 


A^ABB^ and A*ABB^ are hermltian (X = X) 

+ * * 4 - * * 

A ABB A ABB = BB A A 

a'*'aB = B(AB)'*'aB and BB‘*'A = A AB(AB)'*' 

Noting the symmetry of B B and BB we have B B = B B and B B = BB *= I. 
Thus in our case some matrices for which the reversal rule does hold are: 

(b’^b)'*' = b'^b’^'^ 

(Bb")+ = B^V 

( L B)^ = B^ 21“^ for non-singular 21 . 

( Z. B^^)^ = B^ for nonsingular II . 

(Bl)"*’ = r”^B^ if L is unitary and B is rank k. 


(B 


T+ 


H )’*' = £~^B^ if L is unitary and B is rank k. 



3 


If L commutes with then B Z b'^b'^"*^ H ""^b"*” = B Z B^B t- . = 

BB"*"b L Z ^b"*" = BB^Bb"*" = I. Thus in the case of (B £ B^) ^ we have a 
sufficient condition for the reversal rule to hold. The question becomes, 
how far off is B^^ Z. ^B*^ from (B L b"^) The following theorem is a 
useful tool in answering this question: 

A necessary and sufficient condition for the equation AXB = C to have 
a solution is that 

aa'^cb'^b = C 

in which case the general solution is given by 

X = a'^cb'*’ + y - a'^aybb'^ 

where Y is an arbitrary matrix of the same dimension as X, 

T T —1 

Applying this theorem to the equation (B 2 B ) (B Z B ) = I and using 

the preceeding facts yields: 

(1) (B Z. B^)“^ = b'^'^ Z'^b"^ + B^"^ Z"^(I - b"*’b)Y for some Y. [7] 
Using the fact that A ^A = I we find that Y must satisfy the equation: 

(b'^'^ Z “^b'*' + b’^'^ z. - b"^"^ l ■^b'^by) (B £ b'^) = I 
which simplifies to 

(2) b^'^Z "^(i - b'^b)y(b i; b^) = I - B^"^ 2"Vb zb"”^ 

while, since also AA ^ = I,Y must satisfy: 

(B I b'^)(b'^'''£ "^b''’ + B^"^ £“^ - b’^'*' £“^b‘^BY) = I 
which can be written as 


(3) B I B"''b Z"^(I - b"^B)Y = I - B £ b'^B Z'^b'*'. 
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Applying the same theorem to (B L B^) (B £ B^) ^ = I it can be shown that; 

(B Z B*^)"^ - b’’^'*’ £ = (B £ b"^)"^ - B £ b'^. 

In the case of divergence we would be satisfied to solve the problem for 

= B^ or even for B = (I^, O) where is the k^k Identity and 0 is 

the kx(n-k) zero matrix, since in [5] it is shown that in the equivalence 

+ T 

class where maximum divergence occurs there is a B such that B - B and 
from [6] we know that any such B can be written as B = lU where I = 

and U is an nxn unitary matrix. ^ ^ 

^1 ^2 


y 

Theorem; Let B = I = (lj^,0), £ = 


matrix, T. 



j. Y = 




\’'V 

\ 5 6 1 



^3/ 


, a positive definite 
where Y^, £^, and are kxk, and 


£ and £ are (n-k) x (n-k), the other matrices being appropriate sizes 
3 6 


so that B and are kxn and £ is nxn. Then Y = satisfies 


(3) above. 

Proof; First note that - (lA . By substitution, the equation 


"i z-hi - n z-H-' 


becomes 


(\.0) 


1^1 


\‘-2 - 3 j 

t l\ 



'£ 


£\ 


k\ (lj,,0 ) 




£^ ^ y 

^2 3/ 




In-^O / (I^ . 0) 


Y = 


71 ^ 

A, 
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Completing the multiplication we have 


(0, ^5)/ 1 I = Ifc - -^1 ^4 

V 

Since £ ^"^ = I, + ^2 £3 = I and + ^2 ^6 " ° yields 

_ / f Y = ^ 

2 6 2 2 5 


Since 2 is positive definite so also is £ ^ and thus exists, (See 

Appendix 1) Thus ^ = /_£ -1 £ ^ \ satisfies (3). Note that any k x k 


choice for will be acceptable, 


A 

Corollary; If B = lU where U is a unitary matrix and I and £- are 

,-l 


as in the theorem, then Y ® U 


0 \ satisfies (3) where 

A , A T 


Z = U £ U“^ and I “^ = U £ “^u"^ = / ^4 ^5 




Z 


Proof: Since I is rank k and U, unitary, the reversal rule holds and 


I — 1 T 

B = U I . By substitution (3) becomes: 


(lu) £, (u"^i^)(iu) z"^[i-(u"4^)(iu)]y 

= I -(lU) 1 (u"4^)(iu) £"^(u"^i'^) 


Writing I as U~^U, factoring and reassociating we have; 

I(U £ U"^) I^I(U£"V^)[I-i^I)UY 
= I - I(U £ u"^)l'^I(U 



6 


/N - /> 

Since X= U ^ U~ is a similarity transformation 2. is positive positive 

definite if and only if 2 is positive definite. Thus 2-^ exists and 

the result of the corollary is immediate. 

Note that U Z U~^ = U Z is the known covarience for the transformation 

Y = U X. Thus the problem of finding a B which maximizes divergence can be 


treated as a variational problem on U since I is a constant. This may 
further simplify the problem since the set of unitary matrices form a group. 


Appendix 1: 

There are several equivalent definitions of a positive definite symmetric 

matrix. The definition used in [8] is: 

A hermitian matrix is said to be positive definite if all its characteristic 
roots are positive. 

From this definition the following theorem is proved [8]. 

A hermitian matrix is positive definite if and only if the determinants 
of all its principal submatrices are positive. 

Using this theorem we will prove the following: . » 

I I \ 


Theorem: 


If I is positive definite where Z = 


1 

T 


where 


is 




k X k, Z ^ ±s 

Proof : Consider 

dimension k x k 
dimension (n-k) 


(n-k) X (n-k) and Z is 

h ' 

K = i where I, 



(n-k) X k 

and I 1 
n-k 


and (n-k) x (n-k) respectively and 
X k. The inverse of the matrix K is 


then exists. 

are identities of 
Z is a zero matrix of 

y^n-k Z ! 
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/ T \ 

/ T \ 

1^2 ^3 

^2 \ 

\^1 ^2 / 

y ^n-k Z y 


>1 f 

3 2 


^2 


\ 

/ 


K Z K ^ is a similarity transformation on Z. so the eigenvalues are 
preserved. Thus since Z is positive definite so also is K Z K Hence 
as Z 2 is a principal submatrix of K Z K ^ by the theorem quoted from [8] 
Z,^ ^ exists since it has positive determinant. 

Corollary; If Z is positive definite and Z 
exists. 


-1 



then 




Proof: If the characteristic roots of Z. are ^ 2 , ..., then the 


characteristic roots of L.~^ are X^'^.X, , ... 


X"^ 
Z ’^2 


,-l 


, Xj^ . Thus if Z is positive 


,-l 


definite, X^ s* 0 for i= 1, ..., k which implies that X^ >0 in^whlch case 
2. is positive definite. HenceZg^ exists by the previous theorem. 
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In handling expressions which involve matrix inversion and multi- 
plication, the following theorems are often useful: 

Theorem: If A is a positive definite matrix, there exists a non- 

singular matrix F such that FAF^ = I 

Theorem: If B is positive semidefinite and A is positive definite, 

there exists a nonsingular matrix F such that FBF^ = D and 
FAf"^ = I^ where D is a diagonal matrix: whose diagonal ele- 
ments are the roots of the equation det (B - XA) =0. If 
B is positive definite, then the X's are all greater than 
zero . 

The expression for the interclass divergence between two classes is 

(1) D(l,2) = i tr [(Ai-A 2 )(A 2 ^-Ai^)] + | tr [(Ai^Ai^)66'^] 

where Ai ( i = 1,2) is the covariance matrix for class i and 6 is 

the difference between the mean vectors for classes 1 and 2. 

The second of the above theorems has been used(^) to simplify (1). 

In (1), the covariance matrices are positive definite. However, the 
term 66 is not. If results such as the two theorems above could be 
applied to any of the matrices in (1), the simplifications might be more 
useful. To that end we prove the following: 

Theorem 1 - If 6 is an nxl matrix and e > 0, then 66'^ + el is 

positive definite. 

^^^T. W. Anderson, An Introduction to Multivariate Statistical Analysis 
(Hew York: John Wiley and Sons, Inc., I 95 B), pp. 339-341. 

(2) 

^ ^C. Chitti Babu, "On the Application of Divergence to Feature Selection 
in Pattern Recognition," IEEE Transactions On Systems, Man, and Cyber- 
netrics (November 1972), 66b-6Yo'. ~ 



T 

Proof: 65 is obviously symmetric and for every nxl vector x 

(2) X 66 X = (x 6 ) (6 x) = (6 x)( 6 '^x) s 0 
The symmetry of 66 '^ + el is obvious and 

rp iji T T T 

( 3 ) X (66 +el)x =x 66 x+exx >0 

T 

The desired result follows from the fact that ex x = 0 if and 
only if X = 0. 

rp rp 

We will denote the divergence with 66 replaced by 66 + el by 

De (1,2). 

Theorem 2 - For ck’> 0, there is an e> 0 such that | Dg( 1,2) -D( 1,2 ) |< n 

Proof: |Dg(l,2)-D(l,2)( = || tr [(Ap-Ap) (A^^- aI^) ] + 

i tr [(Ai^Ai^)(66T+ei)] . A tr [ (A 1 -A 2 ) (Ai^-A^^)] - | tr 

[(Ai^+A 2 ^) 66 ^]| = i(tr [(a£Va 2 ^) 66 ^] + tr [(Ap^-A^^) el] 

- tr [(A]_^+A2^)65'^])= e | tr (Ai^+A 2 ^) | • Given O' > 0 

2 

choose 0 < e < 2 q? and the result follows. 

|tr(A];^+A2^)| 


The usefulness of Theorem 2 is that when considering the divergence 
expression D (1,2), it may be replaced by an expression, (1,2), involving 
only positive definite matrices, the numerical value of which differs from 
D (1,2) by an arbitrarily small amount. 
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Introduction 


The technique development that follows Is concerned with selecting from 
n-channel multlspectral data some k combinations of the n-channels upon 
which to base a given classification technique so that some measure of the 
loss of the ability to distinguish between classes using the compressed 
k“dlmenslonal data Is minimized. 

In what follows we will assume that we are dealing with the problem of 
classifying Into one of m distinct n-varlate classes (each distributed 
according to N(vi^Z^) 1=1, . . . m) an arbitrary n-channel multlspectral 
measurement vector x. The classification procedure will be the maximum 
likelihood procedure. Information loss In compressing the n-channel data 
to k channels will be taken to be difference In the average Interclass 
divergences (or probability of mlsclasslflcatlon) In n-space and In k-space. 
We will assume that data compression will be accomplished by k><n linear 
transformation l.e., multiplication of the spectral n-vector by a kxn 
matrix of rank k. It should be noted that perhaps the only reason (beyond 
that of generalizing the Idea of "feature selection") for restricting trans- 
formations to be linear transformations of rank k seems to be that of 
convenience. The Idea of Information, divergence and Invariance under trans- 
formation of variables (for example as discussed by Kullback [1]) Is limited 
only to measurable transformations. 



2 ^ 


B-AVERAGE INTERCLASS DIVERGENCE 

Assume the existence of m distinct classes with means and covariances 


n-dlmenslonal mean vector for class 1. 

n by n covariance for class 1, assumed to 
06 positive definite. 


Let - Uj so that 6^^ " *^J1 ^ 


J1 


The Interclass divergence between classes 1 and J Is 


D(1 


,-l 


,J) » f tr{A^[^(Aj + + i tr{Aj^(A^ + 6^^ - n 


Note that when A^ ■ A^ and 

D(i,j) - 0 

so that D(l,j) Is In a sense, a measure of the degree of difficulty of 
distinguishing between classes 1 and j , with the larger the value of 
D(l»j)» the less the degree of difficulty of distinguishing between classes 
1 and j . 

D3 

There is a discussion In Reference* [1],[A] of a natural generalization 
of the Interclass divergence i.e., the average Interclass divergence, defined by 
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..m-l . ™ 


D - 


1«1 J=l+1 
m 




n 




2 2 ^ 


where 


tn 


j=l 




"ij ‘'IJ 


We are interested In performing the transformation 


y “ Bx 


where 

X 5 an n-dimensional observation vector 

B ; a k by n matrix of rank k, with k S n 

y ; the k-dimensional transformed observation vector 


,It is known [3] that corresponding to the transformation y •» Bx, 
the means transforms, 

»i V 


and the covariances transforms, 

^ BAj^b’^ 



Thus subsequent to performing the transformation y ■> Bx, 
we have m classes with means and covariances 

By^ } k-dlmenslonal mean vector for class 1 

BA^B^ ; k by k covariance for class 1» (which Is positive 
definite by the assutq)tlons on B and A^). 


Thus In k-dlmenalonal space, the B-lnduced Interclass divergence 0^(1, J)» 
Is, by definition of the Interclass divergence; 


Dgd.J) " J tr{(BA^B'*^)"^B(Aj + 

+ I tr{(BAjB'‘^)“^(A^ + 6 Jj)b’’^> - k 


Similarly, In k-dlmenslonal space, we can define the B-average Interclass 
divergence, Dg, as 


m-1 m 


7 Z 

®B " 1»1 J=1+1®B^^»^^ 


where, as defined previously 

m 

r, 


s. = 4 1 [A. + ^ -mJ 

1 j=l‘ j ij Ij 

j^l 


Note that In performing the transformation y “ Bx, the dimension of each 



observation is reduced from n to k, so that in a sense, information is lost. 

It is shown in Reference [2] that a measure of the information lost is given 
by the difference 

D - Dg > 0 

We are Interested in minimizing the information lost, as measured by the 
average interclass divergence. Thus, it is desired to maximize the B-average 
Interclass divergence, or equivalently, minimize - D^. 

For p and k integers (p < k) it is shown in IlJ for measurable 

\ n onto _p i „ onto _k 

transformations (in general non linear) E and ^ t. 

that D_ ^ D . This fact, of course, orders (according to dimension) the 
B 

p k 

transformed divergence and, thus, one cannot "gain information" by "compressing" 
or "reducing" the dimension of the data. It is, under certain conditions, 
possible that there is no loss of information in compression i.e., = D 

in which case we say that B, is a sufficient (relative to divergence) statistic 

cC 

[1]. The question of the existence of sufficient statistics has not been resolved 
to any workable degree. 

In an attempt to analyze the problem of maximizing (if possible) D as a 

“k 

function of B, we begin by making the following definition. 

IV 

Definition: If k is an Integer and Bj^ : E^*“— E is measurable then B^^ 
will be called a rank-k maximal statistic provided that for every measurable 

function ^ E^ ; S Dg . 

k k 

■In otlisr words 3. rsnk—k nicixiinEl statistic is a niGasurabla mapping of 
En onto e'^ that makes the transformed divergence as large as possible for a 

k-dlmenslonal subspace. Note that this concept 


given compression to a 



well as the concept of sufficient statistic) does not depend on linear trans- 
formations. Since the current problem setting is that of multivariate normal 
variables we will first examine the multivariate normal case and pursue the 
problem in more generality later. The merit of pursuing the non linear problem 
would be the discovery of conditions under which nonlinear rank-k maximal 
statistics are sufficient statistics. Moreover, it is not known whether or not 
nonlinear sufficient statistics exist whenever there do not exist linear 
sufficient statistics. 

We will first determine (in the multivariate normal case) whether or not 
there exist linear rank-k maximal statistics for a given k < n. Note in 
this case, that in the definition the term "rank-k..." can actually be 
interpreted as "matrix of rank-k" since, for linear transformations, B is 
kxn and rank = k if and only if maps E onto E . 

In what follows we will drop the subscript k on the transformations 
B, unless the meaning of the symbol B is not clearly implied by context. 

ix 

Definition; will denote the set of all kxn matrices of rank k for a 

given integer k. We will regard as a metric (topological) space whose 
topology is given by the metric induced by the norm: 


II B« -ll(b„)|| 



First observe that if Be and B is a rank-k maximal statistic 

(i.e.,^ maximizes D^) then there exists some B e ^ such that Bb'^^ = I 

and D = D^. This follows from the fact that there exists a non singular 
B B 

kxk matrix (pS) (PB)^ = I. Noting that divergence is Invariant under 
non-singular transformations , and B = P^ will satisfy the 
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required conditions. Again, this says that there is a B that maximizes 

T 

D then there is some'*normalized B"(i»e., BB = I) which produces the 
B 

same maximum value of Dg. In other words the' maximum value of Dg is attained 
on the set; 




bb'^ = 1} 


and we may therefore limit our search for the optimum B to the set 




The fact that there actually i^ at least one B that maximizes D is 
established as follows. First note that n. is a compact subset of 
Indeed, it is easy to see that is a bounded set (with respect to //•//) 


since for B e 


« Ihll 


tr BB^ 


= Vtr T = . 


Moreover 




IS a 


closed set since for any sequence of elements B in 

s 




converging to 


B e >5. we have, B B ™ = I has limit I. On the other hand, matrix multiplication 
’ s s 

X XT 

is a continuous mapping so that I = lim B B = (11m B^) (lim B^ ) = BB and 

hence B e 
„k*n 




is both topologically and algebraically equivalent to 


E" “ so that viewing as a subset of and recalling that closed 

and bounded subsets of E^*^ are compact, we have the desired result. 

Now, again, the continuity of matrix multiplication and addition implies 

that D is a continuous scalar valued functions on a compact set 
B 

that, in addition to being bounded above, Dg must attain its maximum value 
at some point of n. This guarantees the existence of a rank-k maximal 


„k»n 


so 


statistic and a solution to the problem. 

This solution is by no means unique. As in [5] there is at least an 


0 jitire equivalence class of matrices B that produce the same maximum divergence. 
For example in the equivalence class determined by a given solution B, any 



unitary transformation of B, say UB has the property that and 

UB(UB)^ = UBbV = I so that there are infinitely many different "normalized" 


solutions. 

Basicaally these results allow the search for the optimum B to be 
limited to the set rather than the entire class of matrices . The 

following results restricts the region to be searched even further and given 


some geometrical insight into the character of a solution. Keep in mind that 
these conditions are eventually going to be used in finding the form of a 
B that satisfies the expression for the gradient of with respect to B 

that appears in [4]. 

The following theorem will be useful in effecting the reduction of the 
class of matrices to be searched for the optimum B. 

Theorem; (Singular Value Decomposition) For each real kxn matrix B there 
exist unitary matrices V(kxk) and U(nXn) such that: 


B = V 5^ U 


where 5^ is a kxn matrix = (w^j) such that =0 if i / j and 

T 

w. . is an eigenvalue of BB for i = j . 
ij 


Corollary: If BB = I then for k < n 


B = V(I^ I Z)U 

where I, is the kxk identity and Z denotes a k x(n-k) matrix of zeros. 

Using the corollary and the rank-k maximal statiitlc B, note that 

= (I f Z)U and that the v“^B— transformed divergence is the B-trans- 
k 

formed divergence is the I Z)U-transformed divergence, i.e.. 







This says that there exists a unitary matrix U for which the B = Z)U - 

transformed divergence is maximum. Another way of looking at it is as follows. 
"BestV linear combination of features can be selected by applying, for the 
proper choice of unitary matrix U, the transformation 

Y = (I, f Z)U X 
kxl 

kxn nxn nxl 


which amounts to "rotating" or "reflecting" the original coordinates of the 
spectral measurement space (i.e., X— — ^UX) then selecting the first k 
components of the resulting vector (i.e., Y = (Ij^| Z) (UX) . 

There are several questions related to these results and they are directly 
related to the discovery of how they may simplify the calculations of the 


gradient of D_ with respect to B. 

JJ 

1. Find the expression for the gradient of D 
with respect to U. 

2. Examine decompositions of U (spectrally. Householder 
transformations, etc.) 

3. Relate U to the normalized eigenvectors of the population 
covariance matrices. 

4. The set of all unitary U form a compact group in 
Examine the group representation applications. 

5. The group in 4 is globally parameterizable. Examine 
applications from theory of Lie groups. 



(\J Z)U 
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CN TIIZ DIZRIVATIVE OF THE GENERALIZED INTERSE OF A 1L'\TRIX 

by 

Henry P. Decell, Jr. 


Aa ror ZlIc derivative of the - inverse of a different- 

ible tiatrix A is given wherever that inverse is indeed differentiable. 


Gitr 


1. Introduction 

li. ^s well knovTTi that if A is a complex matrix whose entries are 
erentiable functions of t, then 


= - A-l A-l 

dt 


( 1 ) 

( 2 ) 


J-Il U;‘.3 C3S3 ZCiCLt. 




is singular or perhaps even rectangular, Hearon [1] 

.i_e ^iVe.i necessary and sufricxent conditions that a differentiable such A 

# 

-.-ve a d — ^.e'entiaDle gueneralited inverse. In addition, necessary and 
Sw^-iCien- conditions are given that (1) and (2) remain valid when A~^ is 
replaced oy a dir ierentiable generalized inverse of A. Of course, this 
n_:.^ or suostitution does not always preserve (1) and (2) and it will be the 
purpose or this paper to give a general expression for the derivative of 

tr.e - inverse of A (whenever that derivative, as well as the derivative 
of A, exists). 


Henry ?. Dccell, Jr., 

r-« ^ 7 7 A /, 


Mathematics Department, University of Houston, Houston 



(6) imply 


:l\(a X ):l\ = x*\(xa):<a - o . 

..cr.cc M;\(A -I- A X )XA = 0 and post raultiplication of this expression by 


A yi^oias 


XAA X X = - A X X 


(i.e. (7)), 


iio cor^ju^ate transpose of the letter e:-:oression is 




X XPA X - - X Xi\ 

encj of course, holds for any A that is dif f erentaible and has a different- 

lebee - iiiverse. It is clear that A satisfies these properties sir.ce 

• * 

(A ) = (A) and (A ) ' = (A) . It follows that, 


XX A a:{ = - XX "'a* 


(i.e. (8)) 


-iieorcrii . rr A is complex and if A and A' are differentiable then 

• i *•» 

*T* Vv -V *T“ —A #C_ • 

A = -AAA +(AAA + AA A) 

A v; "f” -i- 

- A A(A A A + A A A )AA 

Proof: Formal differentiation of (4), (5), (6) yields; 

• • • • ' 

X = XAX + XAX + XAX (4) ’ 

xV -r x'V = AX -J- a:! (5)' 

a"x ‘ + a' x'' = XA + XA (6) ' 


w.'.ere 


X denotes the generalized inverse a"* of A. Moreover, appropriate 
tiplications of (6)' and (5)’ by X yields; 

* • 1 • .j. • 

X^'iX = - XxAX + A X' X + a' x"x 

• • • .'f 

XAX = - XAi + XX 'a' + XX'^a" 


that (4) implies, 


* it * it it * * it it it* it 

X=AXX+AXX- XAX + XX A + XX A 



..j-'-oa tiie Corollary implies 

* • • ^ * "J" *** *'c * *V • • • 

X = - XAX - XAA' X X + A x“x - XX' a’ AX + Xx’'‘a'‘ 
v.nd since X = A we have ^ 


-4- 


4 -* 4 . 


(A') = - A‘AA‘ -r (a“a‘ a"^ + a’^A*'" a"? 


- a'aca'a' a"^ -j- a’^a"*’ a'')aa' 


4. Concluding Reicarlcs 

xt 13 interesting to note that the theorem implies (a"^) is a solution 
ec;uc.>.ion A2:l - - A. which, of course, is analogus to (2). In fact, 
u'c know that when this equation has a solution, all solutions are given by 

‘I *U . Y A AYiU ,.or arbitrary Y having the diir.en„_ons of 2 [2] 

Ynis observation would prompt one to construct the particular Y for 

JL * , 

which Z = (A‘) (whenever (A^‘) exists) if (2) were to be preserved in 

some recognisable way. This is in fact, what was done and, although the 

. , * 

argument or the theorem follows other lines, Y = a’a"^ a”^ + A^a"^ A*. 

It would also be interesting to know the significance, if any, of the 
expression 

- A‘AA‘ + (a'a"^ a’’ aV a'') - a"a(aV a”^ + aV A*)AA':* 

>_.viscs and (A ) does not. To write the expression only 
taquires the existence of A. 

Finally, we have omitted any restatement or generalizations of the re- 

---us in ilj Since the application of the results herein to [1] seem rather 
straightf orward. 
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Introduction 


The technique development that follows is concerned with selecting from 
n-channc'l multispectral data some k combinations of the n-channels upon 
which to base a given classification technique so that some measure of the 
loss of the ability to distinguish between classes using the compressed 
k-dimensional data is minimized. 

In what follows we will assume that we are dealing with the problem of 
classifying into one of m distinct n-variate classes (each distributed 
according to 1=1, . . . m) an arbitrary n-channel multispectral 

measurement vector x. The classification procedure will be the maximum 
likelihood procedure. Information loss in compressing the n-channel data 
to k channels will be taken to be difference in the average interclass 
divergences (or probability of misclassification) in n-space and in k-space. 
We will assume that data compression will be accomplished by kxn linear 
transformation i.e., multiplication of the spectral n-vector by a kxn 
matrix of rank k. It should be noted that perhaps the only reason (beyond 
that of generalizing the idea of "feature selection") for restricting trans- 
formations to be linear trans, formations of rank k seems to be that of 
convenience. The idea of information, divergence and invariance under trans- 
formation of variables (for example as discussed by Kullback [1]) is limited 
only to measurable transformations. 
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B-AVERAGE INTERCLASS DIVERGENCE 

Assume the existence of m distinct classes with means and covariances 

n-dimensional mean vector for class i. 

n by n covariance for class ±, assumed to 
oe positive definite. 

Let 6.. = U. - U. so that <5.. 6..^ = 

ij 1 J ij Ji 

The interclass divergence between classes i and j is 

D(i,j) = I tr{A“^(A. + 6 ^. 6 ^.'^)} + i tr{A"^(A^ + 6^^ - n 

Note that when A. = A. and y. = y., 
i J 1 J 

D(i,j) = 0 

so that D(i,j) is in a sense, a measure of the degree of difficulty of 
distinguishing between classes i and j , with the larger the value of 

the less the degree of difficulty of distinguishing between classes 
i and j . 

LI 

There is a discussion in Reference' [1],[4] of a natural generalization 
of the interclass divergence i.e., the average interclass divergence, defined by 
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.jn-l . ® 

zi 

“ " STjri+i ”(^.3) 


tn m 

= i "'It + «ij hi 


])} - 


m(m-l) 


n 


iH 


m 


_ 1 ^..r^ /i-l^ 1 m(m-l) 

- 2 S^} - -2 n 


where 


m 


S, = 


j-i ‘"d * 

m 


ij ij 


We are interested in performing the transformation 


y = Bx 


where 

X ; an n-dimensional observation vector 
B ; a k by n matrix of rank k, with k S n 

i 

I 

y ; the k-dimensional transformed observation vector 

It is known 
the means transforms. 


and the covariances transforms, 

^ BA^B^ 


[ 3 ] ' that corresponding to the transformation y 


y . y 


= Bx, 



Thus subsequent to performing the transformation y = Bx, 
we have m classes with means and covariances 


; 

BA.b'^ ; 
1 


k-dimensional mean vector for class i 

k by k covariance for class i» (which is positive 
definite by the assurtptions on B and 


Thus in k-dimensional space, the B- induced interclass divergence Dg(i,j), 
is, by definition of the interclass divergence; 


03(1,3) = j tr{(BA/)"S(A^ + 6^.)B^} 

+ i tr{(BAjB^)"^B(A^ + ^ 


Similarly, in k-dimenslonal space, we can define the B-average interclass 
divergence, Dg, as 


m-1 m 

z z 

°B “ i=l j=i+l°B^^»^^ 


= i tH [(BA^ - Sifll 


where, as defined previously 

m 

. Z 


s=ri[A.+ 
i 3=1 3 

j¥i 


6 6 "^ ] 
Ij Ij^ 


Note that in performing the transformation y = Bx, the dimension of each 



observation is reduced from n to k, so that in a sense, information is lost. 
It is shown in Reference [ 2 ] that a measure of the information lost is given 
by the difference 

D - Dg > 0 

We are interested in minimizing the information lost, as measured by the 
average interclass divergence. Thus, it is desired to maximize the B-average 
interclass divergence, or equivalently, minimize - D_. 

D 

It is known that if P is any k«k nonsingular transformation then the 
transformed B-average interclass divergence is an invariant under the trans- 
formation P (i.e., D„ = is not invariant under singular transfomations . 

B rn B 

One can define an equivalence relation on the set of all kxn (rank k) 
matrices as follows. Call Bj*K B2 (for B^ e and B2 e®) if and only 

if there is some nonsingular kxk matrix P such that B^ = PB2. It is an 
easy task to verify that this relation is reflexive, symmetric and transitive 
so that the set B is partitioned into disjoint equivalence classes whose 
union is B. We will denote the set of equivalences by Note (by 

definition of an equivalence class in J$/on ) that the value of the divergence 
at each representative element of a given equivalence class is constant. This 
indicates that if there is a "best" kxn transformation B (in the sense of 
maximizing D„) then each element of the equivalence class determined by that 
B is also an element of that is "best". Note further that each equivalence 
class contains infinitely many elements so that if there is a "best" B then 
there are infinitely many so (there may even be more outside of the equivalence 
class in question (i.e., distinct equivalence classes may have some divergence) 


This problem is of great importance in actual computation of a "best" 



and 


B e The expression for the quantity Dg is non linear in B 
iterative schemes that might be used to calculate the "best" B 
tend to exhibit convergence problems due to the large number of 


may well 

B e 3B 


maximizing (or producing a relative extremum) of Dg. 

Several problems are currently under study: 

1. Determine a workable form for the variation of Dg 
with respect to B. 

2. Characterize (by some workable computational means) a 
single representative element in each equivalence class 
some one or more of which account for all relative extremums 
of Dg. 

3. Determine the number (or cardinality) of . 

4. Determine some ordering ^ on ^/ </> (or subset thereof) 

on which £ S/t/* and B 2 ==> Dg^ s Dg^ 

«/> in 

every B^ z Bj^ and B 2 £ ^2* 

5. Determine whether or not D_ actually attains its maximimi 

a ^ 

value at some (and hence at infinitely many) B £ 
Characterize proper subsets of 3 on which Dg attains 
its maximum (or relative extremum) value. 


6 . 
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Introduction 

The technique development that follows Is concerned with selecting from 
n-channel multlspectral data some k combinations of the n-channels upon 
which to base a given classification technique so that some measure of the 
loss of the ability to distinguish between classes using the compressed 
k-dlmenslonal data Is minimized. 

In what follows we will assume that we are dealing with the problem of 
classifying Into one of m distinct n-varlate classes (each distributed 
according to 1=1, . . . m) an arbitrary n-channel multlspectral 

measurement vector x. The classification procedure will be the maximum 
likelihood procedure. Information loss In compressing the n-channel data 
to k channels will be taken to be difference in the average interclass 
divergences (or probability of misclassification) in n-space and in k-space. 
We will assume that data compression will be accomplished by k^n linear 
transformation i.e., multiplication of the spectral n-vector by a kxn 
matrix of rank k. It should be noted that perhaps the only reason (beyond 
that of generalizing the idea of "feature selection") for restricting trans- 
formations to be linear transformations of rank k seems to be that of 
convenience. The Idea of information, divergence and Invariance under trans- 
formation of variables (for example as discussed by Rullback [1]) Is limited 
only to measurable transformations. 



B-AVERAGE INTERCLASS DIVERGENCE 

Assume the existence of m distinct classes with means and covariances 


n-dlmenslonal mean vector for class 1. 

n by n covariance for class 1, assumed to 
be positive definite. 


Let 6,, “ U, - u 80 that 5.. 6 


Ij -1 “J 


6 .. 6 , 


IJ Ij J1 J1 


The interclass divergence between classes 1 and J Is 


D(i,J) - I tr{Aj'^(Aj + | tr{A“^(A^ + 6 ^^ 6^^^)} - n 




Note that when A. ■ A. and ® u.o 
1 j '^i '^j ^ 


D(i,j) = 0 


so that D(i,j) is In a sense, a measure of the degree of difficulty of 
distinguishing between classes i and j , with the larger the value of 
D(i»j)p the less the degree of difficulty of distinguishing between classes 
1 and j . 

There is a discussion in Reference* [ll,[A] of a natural generalization 
of the interclass divergence l.e., the average interclass divergence, defined by 



-m-1 ; ” 

zz 

m ffi 

■ “ A fA + 

Jl*! 


6.. 6,/])} 


^IJ '^IJ 


m 


2 2 " 


vhere 


m 

?i >''j ^ «ij 

iH 


IJ 


We are intereated In performing the transformation 


y = Bx 


where 

X ; an n-dimensional observation vector 

B ; a k by n matrix of rank k» with k ^ n 

y ; the k-d£mensional transformed observation vector 

iic 

,It is known [3] that corresponding to the transformation y 
the means transforms, 

and the covariances transforms, , , i , ' 

■ I ' I M ^ 

^ BA^b’^ 


I 



Thus subsequent to performing the transformation y “ Bx, 
we have m classes with means and covariances 



k-dlmenslonal mean vector for class 1 

k by k covariance for class !• (which is positive 
definite by the assuitptlons on B and 


Thus in k-dimenslonal space, the B-lnduced Interclass divergence Djj(i,j), 
Is, by definition of the Interclass divergences 


+ I tr{(BA^B^)“^B(A^ + - k 


Similarly, in k-dlmensional space, we can define the B-average interclass 
divergence, D^, as 


m-i m 

z z 

®B " i=l j=i+l®B^^»^^ 


m 


Y tr{^^ [(BA^ b’^)“^(BS^B^)]} - Jc 


where, as defined previously 


m 


2:, 

j=l^ 

j?*l 


6 6-^ ] 
ij ij^ 


Note that in performing the transformation y = Bx, the dimension of each 



observation is reduced from n to k, so that in a sense, information is lost. 
It is shown in Reference [2] that a measure of the information lost is given 
by the difference 

D - D 3 i 0 

We are interested in minimizing the information lost, as measured by the 

average interclass divergence. Thus, it is desired to maximize the B-average 

interclass divergence, or equivalently, minimize 

When the criterion for "feature selection" is based upon the probability 

of misclassification for n-variate normal classes N(y^Z^) i = 1 

one encounters the problem (as in the expression for B-average interclass 

T -1 

divergence) of handling an expression of the form (BZ3 ) i.e., the 

inverse of the covariance of the transformed n-variate spectral variables. 

X 

This expression appears in each class density in the quadratic form (BX-By^) 
(BE^B^)“^(BX-By^) where B is the rank k, kxn matrix to be selected that 
minimize the probability of misclassification. Note that if k = n then 
(BZ.b'^)"^ = B~^ and the quadratic form above then remains invariant 

under the transformation B. 

Since B is rectangular (kxn) and of rank k, we can at m.ost generally 
T 

guarantee that (BE.B ) is indeed an invertible kxk matrix. We cannot, 

X 

however, hope that the relation betv;een the inverse of BE^B and the inverse 
of E . is as simple as that in the case k = n. Indeed, it makes no sense 
to talk about the "Inverse of B" to start with. It is possible to develop 
an expression for the inverse of BE^B in term of the generalized Inverse 
of B and the inverse of 



To this end we will recall the definition of the generalized inverse of 

an arbitrary real matrix A, and a theorem applicable to the derivation of 

T 

the expression for the inverse of . 

Theorem; (Penrose) [ 5 ] For each real matrix A there exists one and only 
one matrix X that simultaneously satisfies the four equations 

1. A X A = A 

2. X A X = X 

3. (XA)'^ = XA 

4. (AX)'^ = AX 

The unique X in this theorem is called the generalized inverse of A and 
is denoted X = a"*". 

Theorem (Penrose) [5] Any matrix equation A X B = C has a solution X 
if and only if 

aa''’c b'^b = C 

The general solution (if there are any solutions (s)) is given by 

X = a’^cb'*' + y - a'^aybb"'' 

where Y is any matrix having the dimension of X. 

We apply the latter theorem in the following way. 

T 

It is certainly true that BZ^B has an inverse since B has rank 

k < n and Z . has rank n . Hence we mus t have 
1 

(bz.b^)(bz.b'^)"^ = 

11 


I. 



This establishes the fact that the matrix equation 

BX = I 

has a solution 

and that (by the second theorem) there must be some Y such that 

= b"^ + (I - b'^b)y 

or 

b'^Cbe.b'^)"^ = ETV + ET^(I - b'^b)y 
111 

Now since B is of rank k, it follows that B B = BB = I so that 

+T 

multiplying the latter equation by B we find that 

(BE.B^)"^ = b'^'^^ET^b'^ + b’*'^ET^(I - B''’b)Y 
' i ' 1 X 

The problem now is to find out just what Y looks like and to examine 
conditions under which Y = Z (the zero matrix) will work. 


This problem will be attacked in a later work. 
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INTRODUCTION 

This paper considers the problem of feature selection or reducing the 
dimension of the data to be processed from n to k. By reducing the dimen- 
sion of the data from n to k, classification time is generally reduced. 

Yet the dimension reduction should not be so great that classification 
accuracy is impaired. Thus, consider the general problem of classifying 
an n-dimensional observation vector x into one of m-distinct classes 
TTp i=l,2,...,m where each class is normally distributed with mean 
and covariance , so that we write = ir^ (y^- ,A^. ) • As shown in Reference 1, 
the probability of mi sclassifi cation is minimized if a maximum likelihood 
classification procedure is used to classify the data. Thus, the notation 
PMC is used to denote this minimal probability of misclassifi cation. The 
dimension of each observation vector to be processed can be conveniently 
reduced by performing the transformation y = Bx, where B is a k by n 
matrix of rank k. Thus, the n-dimensional classification problem transforms 
into a k-dimensional classification problem. The problem becomes one of 
classifying each k-dimensional observation vector y into one of m-distinct 
classes tt^., where now it. = ir-(By^., BA^.B ). In this k-dimensional space 
determined by the row vectors of B, the minimal probability of misclassifi ca- 
tion resulting from applying a maximum likelihood classification procedure is 
denoted by PMCg. Since the transformation y = Bx produces a linear combin- 
ation of the components of the observation vector x, it can be shown that, in 
general, infonnation is lost and 

PMCg > PMC 

Thus, for a fixed k, the feature selection problem could be stated as: 

/S 

select a k X n matrix B from the class of all k by n matrices of rank k 
such that 


PMCg - min PMCg 

where PMCg represents the probability of misclassification resulting from 
applying a maximum likelihood classification procedure on the transformed 
data Bx. 



The problem of evaluating and minimizing PMCg is handled indirectly. 
Let D(i,j) denote the interclass divergence betvyeen classes i and j 
(Reference 2), as determined using n-dimensional information. Similarly, 
let Dg(i,j) represent the interclass divergence between classes i and j 
resulting from performing the transformation y = Bx. It is noted that 
the interclass divergence is a measure of the "degree of difficulty" of 
discriminating betv/een classes and tij, with in general, the larger 
the interclass divergence, the greater the "separation" between classes 
^i (Reference 2) it is true that 

D(i,j) > Dg(i,j) 

it follows that the difference 
D(i,j) - Dg(i,j) > 0 

can be considered as a measure of the separation to be gained for classes 
and TTj. If the average divergence for m classes is defined by 


m-1 m 



i=lj=i+l 


it follows that the "B-average divergence", Dg, satisfies 


m-1 m 
i=l j=i+l 



i=l j^i+1 


i.e., that Dg < D for every k x n matrix B; k = l,...,n. 

We will prove the following theorem. 

Theorem; If D = Dg, then PMCg = PMC. 

These results suggest for fixed k less than n, that one should select B 
so as to maximize Dr,. 
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An initial approach to the problem of selecting the "best" k could be 
obtain the "best" B for various values of k less than n. Then select an 
"adequate" value of k by computing the difference D - Dg, and comparing 
D(i,j) with Dg(i,j) for all distinct class pairs, where now, B is assumed 
to maximize Dg for a fixed k. The comparison of D(i,j) with Dg(i,j) for 
all distinct class pairs will constitute what we will call a "Class Separa- 
bility to be Gained Map". For a given set of classes tt. and tt., the value 

* w 

of Dg(ijj) can be considered to represent the separability between classes 
TT. and IT. resulting from the transformation y = Bx. The difference D(i,j) 

* J 

“ Dg(i,j) > 0 represents the separation to be gained for this class pair. 

Thus, we desire to find an integer k (preferably as small as possible) and 
corresponding optimal B such that the difference D(i,j) - Dg(i,j) is "small" 
for all distinct class pairs. 

Tou and Heydorn (Reference 3) proposed a procedure to maximize Dg(i,j), 
as a function of B. However, this procedure is valid only in case m = 2, 
i.e., the two class problem. Babu (Reference 4) extended the above procedure 
to the multi -cl ass problem by proposing a procedure for maximizing Dg. Both 
procedures amount to computing the gradient of the appropriate function Dg 
OJ" Dg(i,j) with respect to B. Babu's expression for the gradient of the 
average divergence Dg with respect to B is (in addition to being incorrect) 
rather lengthy and numerically unattractive since it is expressed in terms 
of many eigenvalues and eigenvectors. 

In this paper, vie derive a simple expression for the gradient of Dg 
with respect to B. This expression for the gradient is free of any require- 
ment for computation of eigenvectors or eigenvalues, and, in addition, all 
matrix inversions necessary to evaluate the gradient are available from com- 
puting Dg. Thus, the feature selection problem becomes one of maximizing 
Dd over the class of all k by n matrices of rank k. We will further show that 
the maximum value of Dg is attained on the compact set, 3 = [B:BB = I] and, 
further, that the maximum value of Dg is attained on [Beg: B = (I|^|Z)U v/liere U 
is an isometry.] Geometrical interpretations of the results will be discussed 
as in References 6 & 7. 
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It will be shown that it is convenient to write Dn as 

D 


m 


Dg = 1/2 tr ^ (Ba.b’’')“''(BS.b’'’) 
^ i=l 


k(m) (m-1 ) 
2 


where denotes the positive definite symmetric matrix; 




j=l 

J/i 


6 . . - u • ” u • 
IJ 


We will show with that, the gradient of Dg with respect to B is 



A.b'^(BA^.b’'’)'''(BS.b'^) ] (BA.B^)"'' 


The theoretical development of these techniques was an outgrowth of 
University of Houston Mathematics Department Seminars in Pattern Recognition 
and Classification Theory, The expression for the gradient Dg and tfie re- 
lated results appear in References (5-8). 

A computer program based on these results was subsequently developed 
to maximize Dg for a given k (Reference 9). The program utilizes (in the 
iterative solution of the variational equation for B) the Davidon Iterator 
(based on the Davi don-Fletcher-Powel 1 technique) generously provided by 
Ivan Johnson, Johnson Space Center (Reference 10). 

RESULTS 

This section summarizes the results for a 12-dimensional data set 
obtained from the .Cl flight line. In particular, nine distinct classes 
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are considered corresponding to soybeans, corn, oats, red-clover, alfalfa, 
rye, bare soil, and two distinct classes of wheat. The 12 by 12 covar- 
iances and 12 by 1 means for each crop are as defined in Reference 11 and; 
obtained by actually sampling the Cl flight line data. (Additional 
results for different data sets are presented in the paper). Three 
particular cases corresponding to k = 2, 3 and 6 are considered. Let 
B|^ denote that matrix B of rank k which maximizes Dg for a given k less 
than n. Then the results for this data set are summarized in Table 1 
below: 


Table 1. 


k 

2 

3 

6 

0 

m 

33.4 

45.6 

63.0 

* 

Dr - 

57.1 

67.1 

72.6 

RATIO 

.78 

.92 

.99 


In Table 1., Do 
^k 


represents the maximum value of Dgfor a 


given k and is obtained numerically, 

RATIO denotes the ratio Do /D, where 

o,. 


as discussed previously. The term 
as discussed previously, D > Dg. 


Mote that when k = 6, this RATIO is .99, the implication being that almost 
no information is lost by performing the transformation y = Bx, where 
B is a 6 by 12 matrix which maximizes Dg. Since no information is lost, 
it will be shown that for this B, PMCg« PMC, so that B also essentially 
minimizes the probability of misclassification. 


The other values appearing in Table 1 corresponding to D^^ are 
obtained as follows. Let k be fixed with n equal to 12, so that each 
observation vector x constitutes a tuple 

x — (x-|, X 2 ...JX 12 ) 


★ 

The numbers appearing in Table 1 or discussed in this report are scaled 
corresponding to Dj;^/180 or Dg /ISO. 

°k 
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Now by selecting the first k components of every observation vector x 
a k-dimensional subspace is generated. Mathematically, selecting the 
first k components, for the particular case of k=3, is equivalent to 
performing the operation 

( 100000000000 \ 
oioooooooooojx 
001000000000 / 

= Bx 

Thus associated with the selection of the first k components of x is a cor- 
responding B matrix, so that the B-average divergence Dg can be computed. 
This process can be repeated for each distinct set of k components, with 
the total number of distinct sets being the number of combinations of n 
objects taken k at a time. Thus to each distinct set of k components cor- 
responds a distinct matrix B. 

In particular, when k = 6, 924 distinct evaluations of the B-average 
divergence must be performed. For a fixed k, the evaluation of all the 
distinct B-average divergences, corresponding to the number of distinct 
combinations of n elements taken k at a time, constitutes what is called 
an exhaustive search procedure. 

Referring back to Table 1, the value of with k = 3 is obtained 
by selecting the ninth, eleventh, and twelfth components of each obser- 
vation vector and evaluating the resulting B-average interclass diver- 
gence. Evaluating the B-average interclass divergence for all other 
distinct three component combinations is found to result in a smaller 
value of the B-average divergence (Again, it should be recalled that 
associated with each distinct 3 component combination is a distinct 
3 by 12 B matrix). By repeating the exhaustive search procedure for 
k = 2 and k = 6, it is possible to generate the values of presented 

in Table 1. Note that for the corresponding values of k, Bg is signi- 

k 

ficantly larger than Dg^. Also the value 67.1 attained by Dg (when 
k = 3) is not attained with the exhaustive search procedure until k = 7, 
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so that it v;ould take the seven "best" components of each observation 
vector to retain information equivalent to that retained by (as meas- 
ured by the average divergence). Recall the time to classify data is 
proportional to n(n+l), so that the time to process the data in the three- 
dimensional feature space would be approximately 3/14 the computational 
time required to process the 7-dimensional data using the best 7 components 
of each observation vector - yet the performance would be approximately 
the same in that similar classification maps v/ould be generated. 

It is noted that for a given k, the optimal B|^ which maximizes Dg 
is obtained in less time than is necessary to execute an exhaustive search 
procedure. Also, less than three minutes of Univac 1108 computer time 
is necessary to obtain B 2 , B^ and Bg, with an average for any given k, of 
about 120 evaluation of Dg and 25 evaluations of 3Dg/3B being necessary. 

The problem of selecting the best k- namely the smallest integer k 
such that adequate class separation is maintained is handled by construct- 
ing a so-called "Class Separability to be Gained Map," and is shown in 
Figure 1. In general, this map compares the k-dimensional interclass diver 
gence Dg(i,j) with the 12-dimensional interclass divergence D(i ,j) for each 
distinct i-j pair, where as shown in Reference 2. 

D(i,j) > Dg(i,j) 

In particular. Figure 1 compares the three-dimensional feature space 
interclass divergence Dp (i,j) with D(i,j), with the vertical distance 

from each point to the solid diagonal line representing the interclass 
separability to be gained for each distinct class pair. Thus for a given 
i-j pair, its abscissa on the class separability to be gained map is fixed, 
and as k is allowed to increase, its ordinate will increase until finally 
it attains the diagonal line when k = 12. In an interactive system, by 
displaying the class separability to be gained map on a console for a 
fixed k, the user could decide if he is satisfied with both the separabil- 
ity and the separability to be gained for all distinct class pairs. A 
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critical situation can be assumed to occur when for a given class pair, 
the separability is "small" and the separability to be gained is "large", 
or equivalently, when D. (i,j) is small and the difference 

D(i,j) - Dr (i,j) 

is large. Such a critical situation could possibly be indicated by the 
circled point appearing on Figure 1, which corresponds to the classes, 
oats and wheat. Such a situation could be handled by increasing k (in 
this case from 3 to 4). By resolving the optimization problem for B^, 
a new class separability to be gained map could be generated and displayed. 

Finally, the symbols a appearing in Figure 1 represent the separa- 
tion between particular class pairs resulting from the "best" three channel 
combination as obtained from the exhaustive search procedure (i.e., 
channels 9, 11, and 12). The increase in class separation for these 
class pairs resulting from B^ is clearly significant. 
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DIVERGENCE D. (i ,j) IN THREE DIMENSIONAL FEATURE SPACE 
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